Super fast way to compare if two strings are equal

2024/9/22 3:46:34

Obviously, in Python to check whether two strings are equal you can do:

"hello word" == "hello world"

But what if you are comparing really long strings (in excess of 1m characters)? Is there a built in way or any libraries in python that can do this much faster; perhaps utilizing the Karp–Rabin algorithm or something similar?

Or, under the hood, is stringA == stringB actually the fastest method?

Answer

(EDITED: to improve overall quality).

Considering how str == str is implemented in Python, this first gets an id() check, length check and then goes on element by element. This is quite fast and understandably so, since a lot of Python code relies on this. In the average case, there is no need for further optimization as arbitrary strings will be different quite early.

However, there are two use cases where there is some room for optimization:

  • you have some partial information on how the two inputs are going to be different.
  • you perform multiple comparisons among a certain set of elements (see @wim comments).

An example of the first situation is: if you know that when two inputs of the same size are different, they are likely different for at least m contiguous elements, then performing a comparison every k elements with k < m will be a reasonable bet, e.g.:

def is_k_equal(a, b, k=4096):if k in {0, 1}:return a == belse:return a[::k] == b[::k]def is_equal_partial(a, b, partial_match=is_k_equal):return len(a) == len(b) and partial_match(a, b) and a == b

An example of the second situation is: if you want to know which p inputs out of q are pairwise equal, it may be beneficial to compute a hash (for example using hash(), but other options may be equally valid) of your inputs and only perform a full comparison when the hashes match. It goes without saying that if your hash has high collision rating, it may just introduce additional overhead (see Wikipedia for information on hashing). The hashes of the input could be either manually managed, or you could guard your full comparison with a hash comparison in a is_equal() function, e.g.:

def is_equal_hashing(a, b, hashing=hash):return len(a) == len(b) and hashing(a) == hashing(b) and a == b

provided that your hashing function is memoized. For hash() you do not need to do anything else, as this is already memoized for these inputs. If you were to use a fancier hashing (e.g. crc32, md5, etc.), you may need to add memoization yourself, e.g with @functools.lru_cache. The condition for this use-case seeing benefits from this approach (ignoring the time required to compare hashes which is usually much faster then the other operations to be considered) is:

t_h * n_h < t_c * n_c

with t_h the initial hash computation time, n_h the number of unique hashes to compute, t_c the comparison time, and n_c the number of full comparisons which fail near the end of the inputs.

When in doubt on how things will perform on your input, it is typically a good idea to measure / profile your code.

Care must be taken when timing memoized functions (like hash()), because, if you are interested in the performance of the non-memoized path, you cannot rely on timings of multiple repeated calls of the same input as it is typically done, for example with IPython's %timeit using default parameters. Instead, you may use %timeit -n1 -r1 for un-cached timings. The results would only be useful for order of magnitude estimates.

To give you some ideas on how fast the possible ingredients of your approach are, here are some micro-benchmarks:

import hashlib
import functoolsdef md5(data):return hashlib.md5(data).digest()@funtools.lru_cache(maxsize=16384)
def sha1(data):return hashlib.sha1(data).digest()def sha256(data):return hashlib.sha1(data).digest()def sha512(data):return hashlib.sha1(data).digest()
import numpy as np
import numba as nb@nb.jit(fastmath=True)
def hash_sum_nb(data, init=0):dtype = np.uint64nbytes = 8n = len(data)offset = n % nbytesresult = initif offset:body = np.frombuffer(data[:-offset], dtype=dtype)tail = np.frombuffer(data[-offset:], dtype=np.uint8)for x in tail:result += xelse:body = np.frombuffer(data, dtype=dtype)for x in body:result += xreturn result + n
import zlib
import string
import randomn = 1_000_000
s = b''.join(string.printable[random.randrange(len(string.printable))].encode() for _ in range(n))
funcs = hash, hash, zlib.crc32, zlib.adler32, md5, sha1, sha1, sha256, sha512, hash_sum_nb
for func in funcs:result = %timeit -n1 -r1 -q -o func(s)print(f'{func.__name__:>12s}  {result.best * 1e6:.3f} µs')
#         hash  586.401 µs
#         hash  0.853 µs
#        crc32  976.128 µs
#      adler32  468.452 µs
#          md5  1790.659 µs
#         sha1  1362.948 µs
#         sha1  1.423 µs
#       sha256  1347.432 µs
#       sha512  1321.981 µs
#  hash_sum_nb  64.768 µscases = {'worst case': (s[:-1] + b'x', s[:-1] + b'y'),'best case*': (s[:-1], s[:-2]),'best case': (b'x' + s[:-1], b'y' + s[:-1]),
}
for name, (a, b) in cases.items(): result = %timeit -n1 -r1 -q -o a == bprint(f'str == str ({name:10s})  {result.best * 1e6:.3f} µs')
# str == str (worst case)  142.466 µs
# str == str (best case*)  0.856 µs
# str == str (best case )  1.012 µsa, b = (s[:-1] + b'x', s[:-1] + b'y')
result = %timeit -n1 -r1 -q -o is_k_equal(a, b)
print(f'{is_k_equal.__name__:>12s}  {result.best * 1e6:.3f} µs')
# is_k_equal  10.037 µs

Note that both hash() and sha1() are called twice on the same input to show the effects of memoization.

With this data (or similar numbers that you could produce on your input / system), it may be possible to craft a more performant string equality comparison. Note that throughout the answer I used bytes instead. The timings for str would generally be worse for most hashing, because of the additional overhead required to handle the encoding, with the notable exception of hash().


https://en.xdnf.cn/q/71990.html

Related Q&A

Pandas DataFrames in reportlab

I have a DataFrame, and want to output it to a pdf. Im currently trying to use ReportLab for this, but it wont seem to work. I get an error here:mytable = Table(make_pivot_table(data, pivot_cols, colum…

How to open and close a website using default browser with python

Im trying to write a python script on windows platform to open a webpage(such as Google), and then, after 10 seconds, close this website. Note: Im using Windows 7, Python 2.7.10, and IE

Comparing numpy array with itself by element efficiently

I am performing a large number of these calculations:A == A[np.newaxis].Twhere A is a dense numpy array which frequently has common values.For benchmarking purposes we can use:n = 30000 A = np.random.r…

Kivy: BoxLayout vs. GridLayout

BoxLayout(orientation=vertical) vs. GridLayout(cols=1):They both do the same thing, no? Is there a reason to choose one over the other?

Flask circular dependency

I am developing a Flask application. It is still relatively small. I had only one app.py file, but because I needed to do database migrations, I divided it into 3 using this guide:https://realpython.co…

How to create tox.ini variables

Is there a way to set arbitrary variables within tox.ini?An example would be a project name that might be used in a variety of ways. With a rather complex tox.ini, I find myself copy and pasting all …

How to apply json_normalize on entire pandas column

I have a dataframe with LISTS(with dicts) as column values . My intention is to normalize entire column(all rows). I found way to normalize a single row . However, Im unable to apply the same function …

Configure Vs code version 2.0.0 Build Task for python

I need help in configuring my Vs code to run scripts in python using Cntrl Shift B, I was working fine until Vs code upgraded to version 2.0.0 now it wants me to configure the Build. And I am clueless…

Generate N-Grams from strings with pandas

I have a DataFrame df like this: Pattern String 101 hi, how are you? 104 what are you doing? 108 Python is good to learn.I want to crea…

Merge dataframes on multiple columns with fuzzy match in Python

I have two example dataframes as follows:df1 = pd.DataFrame({Name: {0: John, 1: Bob, 2: Shiela}, Degree: {0: Masters, 1: Graduate, 2: Graduate}, Age: {0: 27, 1: 23, 2: 21}}) df2 = pd.DataFrame({Name: {…