In Python, I would like to quickly compute an order-invariant hash for the lines of a file as a way to identify "uniquely" its content. These files are for example the output of a select ... from table
and thus the order of the lines is random.
Here is an example that achieves what I want (using one of the hashers in hashlib), but at the expense of having to sort the lines. Note that sorting the lines is just a way to achieve the goal, i.e. to get a hash that doesn't depend on the ordering of the lines in the file. But clearly, I'd like to avoid the O(n*log(n)) cost, esp. when the files are much longer.
def get_hexdigest(filename, hasher, blocksize=65536, order_invariant=False):
if not os.path.isfile(filename):
return None
if order_invariant:
with open(filename, 'r') as f:
for line in sorted(f):
hasher.update(line.encode())
else:
with open(filename, 'rb') as f:
while True:
buf = f.read(blocksize)
hasher.update(buf)
if len(buf) < blocksize:
break
return hasher.hexdigest()
So, for e.g. 1MB, 50K rows file:
%%time
get_hexdigest('some_file', hashlib.sha1())
# Wall time: 1.71 ms
But:
%%time
get_hexdigest('some_file', hashlib.sha1(), order_invariant=True)
# Wall time: 77.4 ms
What is a better/faster way to do that?
As noted in this answer, Scala has an order-invariant hash based on Murmurhash, but I assume it is the 32-bit version of mmh3 (too collision-prone for my usage), and also I would rather use some standard library available in Python rather than implementing something in C or in Cython. Murmurhash3 has a 128bit version, but its output is different on x64 vs x86. I would like to have machine independent results.
So, in summary, I would like:
- consistent results across machine architectures
- low collision rate, i.e. at least 128 bits with good dispersion (but I don't need the hash to be cryptographic)
- reasonably fast, i.e. at least under 5ms for 1MB, 50K lines file.
- readily available, if possible, as a library on PyPi or Conda.
- amenable to files with repeated lines (so just XORing per-line hashes is a non-starter, as any pair of identical lines would cancel each other).
Edits and notes:
Thanks to several comments, the code above is updated to sort lines in memory. The original version for order_invariant is True
was:
with os.popen('sort {}'.format(filename)) as f:
for line in f:
hasher.update(line.encode(encoding='utf-8'))
return hasher.hexdigest()
The associated wall time (for the file used above) was then 238 ms. This is now reduced to 77 ms, but still way slower than not sorting the lines. Sorting will add a n*log(n) cost for n lines.
The encoding (to UTF-8) and reading in mode 'r'
nor 'rb'
is necessary when reading lines, as then we get strings not bytes. I don't want to rely on assuming that the files contain only ASCII data; reading in 'rb'
could lead to lines not properly split. I don't have the same concern when order_invariant
is False, because then I don't have to split the file, and thus the fastest way is to slurp chunks of binary data to update the hasher.
sort
the python lines usingfor lines in sorted(f)
instead of calling an external process... – Eastonhasher
with.readline()
output. – Rinker.encode
on every line:with open(...) as f: for line in sorted(f.read().encode('utf-8').split('\n')): hasher.update(line)
(yes, it loads the whole file into memory, but sorting requires this). – KippyxxHash
( pypi.python.org/pypi/xxhash )? – Rinkersort
, doing it in memory is much faster than calling the system's version. It is not always the case: for example, doing gzip() compression or decompression is much faster calling the linux gzip (viaos.popen('gzip -c file')
) than using the library. – Toxoid'r'
mode, but then I can afford to just slurp binary data when no sorting is needed: it's just to guarantee a correct split by lines even if the file contains non-ASCII text. No such concern if we are not trying to split by lines. – ToxoidxxHash
. Could you please elaborate how to use it for this problem? – Toxoid(for i in stash/data_*.tsv; do perl -0777 -e '$_=<>; for (sort (split(//))) {print}; print "\n"' $i | sha1sum -; done) > /tmp/out
andwc -l /tmp/out
vssort -u /tmp/out | wc -l
. – Toxoidintdigest()
) you can just sum the hashes of each line (e.g.for line in f: hash_sum += xxhash.xxh64(line).intdigest()
) and then you can hash the resulting sum in the end to reduce it to 64 bits. I still doubt you'll reach your target 5ms, tho - it probably takes that much time just to go through the 50k lines file line by line, not counting the hashing overhead. You might want to look intonumpy.fromfile()
orpandas.read_csv()
for very fast ways to get file contents into memory. – Rinkerxxhash.xxh64
, I find that while.update()
is very fast over a large piece of data, the many calls to.reset()
,.update()
and.intdigest()
make the whole thing slow. This is compelling me to look into numba's jit and code my own version of murmurhash (one that has an order-invariant function like the Scala code linked in the question). – Toxoid