Pythons difflib SequenceMatcher speed up

2024/9/21 3:13:00

I'm using difflib SequenceMatcher (ratio() method) to define similarity between text files. While difflib is relatively fast to compare a small set of text files e.g. 10 files of 70 kb on average comparing to each other (46 comparisons) takes about 80 seconds.

The issue here is that i have a collection of 3000 txt files (75 kb on average), a raw estimation on how much time SequenceMatcher needs to complete the comparison job is 80 days!

I tried "real_quick_ratio()" and "quick_ratio()" methods, but they don't fit to our needs.

Is there any way to speed up the comparison process? If not, is there any other faster method to do such a task? Even if it is not in Python.

Answer

The issue you're finding is very common, since difflib is not optimized. Here are some tricks I've found over the years while developing a tool that compares HTML documents.

Files fit in memory

Create two lists, containing the lines from each file. Then call difflib.SequenceMatcher with the lists as parameters. The SequenceMatcher knows how to handle lists, and the process will be much faster since it is done on a line by line basis instead of char by char. This might reduce the precision.

Take a look at fuzzy_string_cmp.py and diff.py to see how I'm doing exactly this.

Alternative

There is a great library called diff_match_patch which is available in pypi. The library will perform fast diffs between two strings and return the changes (line added, line equal, line removed).

By leveraging diff_match_patch you should be able to create your own dmp_quick_ratio function.

In diff.py you can see how I'm using the library to get inspiration for creating dmp_quick_ratio.

My tests showed that using diff_match_patch was 20 times faster than Python's difflib.

https://en.xdnf.cn/q/72270.html

Related Q&A

create an asymmetric colormap

I am creating a colormap to map colors in a folium choropleth map, using code from here:from branca.colormap import linearcolormap = linear.RdBu.scale(df.MyValue.min(),df.MyValue.max())colormapAs you c…

NLTK - Get and Simplify List of Tags

Im using the Brown Corpus. I want some way to print out all the possible tags and their names (not just tag abbreviations). There are also quite a few tags, is there a way to simplify the tags? By sim…

PolynomialFeatures object has no attribute predict

I want to apply k-fold cross validation on the following regression models:Linear Regression Polynomial Regression Support Vector Regression Decision Tree Regression Random Forest RegressionI am able t…

Error module object has no attribute freetype

I am using this code Link but it displays error of module object has no attribute i tried to pip install freetype but nothing happened. Can anyone please guide me with this.import cv2 import numpy as …

Count total number of white pixels in an image

I am trying to count total number of white pixels in the following image:But with my code, I get this errorsrc is not a numpy array, neither a scalar.This is my code: img=cv2.imread(filename,1) TP= wid…

Pass a JSON object to an url with requests

So, I want to use Kenneth excellent requests module. Stumbled up this problem while trying to use the Freebase API.Basically, their API looks like that:https://www.googleapis.com/freebase/v1/mqlread?q…

jenkinsapi python - how to trigger and track the job result

I am using JenkinsAPI to trigger parametrized jobs. I am aware of the REST API that Jenkins use, but our setup does not allow that directly; so the main mean for me to trigger jobs is through this libr…

Django test parallel AppRegistryNotReady

I am trying to understand how to run django tests in parallel with in memory sqlite3.I have django app with that structure:gbookorder...tests__init__.pytest_a1.pytest_b1.pyutils.pytest_a1.py and test_b…

ImportError: PyCapsule_Import could not import module pyexpat

I am using Jenkins to build a python (Flask) solution to deploy to Google App Engine. As part of the build process I run a few integration tests. One of them is failing with the following error. ERROR:…

Python - Get max value in a list of dict

I have a dataset with this structure :In[17]: allIndices Out[17]: [{0: 0, 1: 1.4589, 4: 2.4879}, {0: 1.4589, 1: 0, 2: 2.1547}, {1: 2.1547, 2: 0, 3: 4.2114}, {2: 4.2114, 3: 0}, {0: 2.4879, 4: 0}]Id lik…