Obviously, in Python to check whether two strings are equal you can do:
"hello word" == "hello world"
But what if you are comparing really long strings (in excess of 1m characters)? Is there a built in way or any libraries in python that can do this much faster; perhaps utilizing the Karp–Rabin algorithm or something similar?
Or, under the hood, is stringA == stringB actually the fastest method?
(EDITED: to improve overall quality).
Considering how str == str
is implemented in Python, this first gets an id()
check, length check and then goes on element by element.
This is quite fast and understandably so, since a lot of Python code relies on this.
In the average case, there is no need for further optimization as arbitrary strings will be different quite early.
However, there are two use cases where there is some room for optimization:
- you have some partial information on how the two inputs are going to be different.
- you perform multiple comparisons among a certain set of elements (see @wim comments).
An example of the first situation is: if you know that when two inputs of the same size are different, they are likely different for at least m
contiguous elements, then performing a comparison every k
elements with k < m
will be a reasonable bet, e.g.:
def is_k_equal(a, b, k=4096):if k in {0, 1}:return a == belse:return a[::k] == b[::k]def is_equal_partial(a, b, partial_match=is_k_equal):return len(a) == len(b) and partial_match(a, b) and a == b
An example of the second situation is: if you want to know which p
inputs out of q
are pairwise equal, it may be beneficial to compute a hash (for example using hash()
, but other options may be equally valid) of your inputs and only perform a full comparison when the hashes match.
It goes without saying that if your hash has high collision rating, it may just introduce additional overhead (see Wikipedia for information on hashing).
The hashes of the input could be either manually managed, or you could guard your full comparison with a hash comparison in a is_equal()
function, e.g.:
def is_equal_hashing(a, b, hashing=hash):return len(a) == len(b) and hashing(a) == hashing(b) and a == b
provided that your hashing function is memoized. For hash()
you do not need to do anything else, as this is already memoized for these inputs.
If you were to use a fancier hashing (e.g. crc32, md5, etc.), you may need to add memoization yourself, e.g with @functools.lru_cache
.
The condition for this use-case seeing benefits from this approach (ignoring the time required to compare hashes which is usually much faster then the other operations to be considered) is:
t_h * n_h < t_c * n_c
with t_h
the initial hash computation time, n_h
the number of unique hashes to compute, t_c
the comparison time, and n_c
the number of full comparisons which fail near the end of the inputs.
When in doubt on how things will perform on your input, it is typically a good idea to measure / profile your code.
Care must be taken when timing memoized functions (like hash()
), because, if you are interested in the performance of the non-memoized path, you cannot rely on timings of multiple repeated calls of the same input as it is typically done, for example with IPython's %timeit
using default parameters. Instead, you may use %timeit -n1 -r1
for un-cached timings. The results would only be useful for order of magnitude estimates.
To give you some ideas on how fast the possible ingredients of your approach are, here are some micro-benchmarks:
import hashlib
import functoolsdef md5(data):return hashlib.md5(data).digest()@funtools.lru_cache(maxsize=16384)
def sha1(data):return hashlib.sha1(data).digest()def sha256(data):return hashlib.sha1(data).digest()def sha512(data):return hashlib.sha1(data).digest()
import numpy as np
import numba as nb@nb.jit(fastmath=True)
def hash_sum_nb(data, init=0):dtype = np.uint64nbytes = 8n = len(data)offset = n % nbytesresult = initif offset:body = np.frombuffer(data[:-offset], dtype=dtype)tail = np.frombuffer(data[-offset:], dtype=np.uint8)for x in tail:result += xelse:body = np.frombuffer(data, dtype=dtype)for x in body:result += xreturn result + n
import zlib
import string
import randomn = 1_000_000
s = b''.join(string.printable[random.randrange(len(string.printable))].encode() for _ in range(n))
funcs = hash, hash, zlib.crc32, zlib.adler32, md5, sha1, sha1, sha256, sha512, hash_sum_nb
for func in funcs:result = %timeit -n1 -r1 -q -o func(s)print(f'{func.__name__:>12s} {result.best * 1e6:.3f} µs')
# hash 586.401 µs
# hash 0.853 µs
# crc32 976.128 µs
# adler32 468.452 µs
# md5 1790.659 µs
# sha1 1362.948 µs
# sha1 1.423 µs
# sha256 1347.432 µs
# sha512 1321.981 µs
# hash_sum_nb 64.768 µscases = {'worst case': (s[:-1] + b'x', s[:-1] + b'y'),'best case*': (s[:-1], s[:-2]),'best case': (b'x' + s[:-1], b'y' + s[:-1]),
}
for name, (a, b) in cases.items(): result = %timeit -n1 -r1 -q -o a == bprint(f'str == str ({name:10s}) {result.best * 1e6:.3f} µs')
# str == str (worst case) 142.466 µs
# str == str (best case*) 0.856 µs
# str == str (best case ) 1.012 µsa, b = (s[:-1] + b'x', s[:-1] + b'y')
result = %timeit -n1 -r1 -q -o is_k_equal(a, b)
print(f'{is_k_equal.__name__:>12s} {result.best * 1e6:.3f} µs')
# is_k_equal 10.037 µs
Note that both hash()
and sha1()
are called twice on the same input to show the effects of memoization.
With this data (or similar numbers that you could produce on your input / system), it may be possible to craft a more performant string equality comparison.
Note that throughout the answer I used bytes
instead.
The timings for str
would generally be worse for most hashing, because of the additional overhead required to handle the encoding, with the notable exception of hash()
.