I need to compare large chunks of data for equality, and I need to compare many pairs per second, fast. Each object is guaranteed to be the same length, it is possible and likely that there may only be slight differences located at unknown positions.
Timings below show that using ==
operator is very fast if there is a difference near the start of the data, and significantly slower if differences are located towards the end.
>>> import os
>>> s = os.urandom(1600*1200 - 1)
>>> Aimg = b"A" + s
>>> Bimg = b"B" + s
>>> img1 = s + b"1"
>>> img2 = s + b"2"
>>> %timeit Aimg == Bimg
61.8 ns ± 0.484 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit img1 == img2
159 µs ± 2.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In my use case, differences may be located towards the middle or the end of the bytes (context: it is uncompressed image data). I looked for a way to speed things up using a hash or checksum. Using md5 was slower but Python's built-in hash
did actually speed things up.
>>> %timeit img1 == img2
160 µs ± 5.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit hash(img1) == hash(img2) and img1 == img2
236 ns ± 5.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I'm interested in technical detail of this hash, is it sufficiently hash-like that when hash(a) == hash(b)
then a == b
is very likely? False positives are acceptable if a hash collision is reasonably rare, the intention is a fast-path to speed up comparisons in the average case.
hash
function is architecture dependent – Situation