Order-invariant hash in Python
Asked Answered
T

4

13

In Python, I would like to quickly compute an order-invariant hash for the lines of a file as a way to identify "uniquely" its content. These files are for example the output of a select ... from table and thus the order of the lines is random.

Here is an example that achieves what I want (using one of the hashers in hashlib), but at the expense of having to sort the lines. Note that sorting the lines is just a way to achieve the goal, i.e. to get a hash that doesn't depend on the ordering of the lines in the file. But clearly, I'd like to avoid the O(n*log(n)) cost, esp. when the files are much longer.

def get_hexdigest(filename, hasher, blocksize=65536, order_invariant=False):
    if not os.path.isfile(filename):
        return None
    if order_invariant:
        with open(filename, 'r') as f:
            for line in sorted(f):
                hasher.update(line.encode())
    else:
        with open(filename, 'rb') as f:
            while True:
                buf = f.read(blocksize)
                hasher.update(buf)
                if len(buf) < blocksize:
                    break
    return hasher.hexdigest()

So, for e.g. 1MB, 50K rows file:

%%time
get_hexdigest('some_file', hashlib.sha1())
# Wall time: 1.71 ms

But:

%%time
get_hexdigest('some_file', hashlib.sha1(), order_invariant=True)
# Wall time: 77.4 ms

What is a better/faster way to do that?

As noted in this answer, Scala has an order-invariant hash based on Murmurhash, but I assume it is the 32-bit version of mmh3 (too collision-prone for my usage), and also I would rather use some standard library available in Python rather than implementing something in C or in Cython. Murmurhash3 has a 128bit version, but its output is different on x64 vs x86. I would like to have machine independent results.

So, in summary, I would like:

  • consistent results across machine architectures
  • low collision rate, i.e. at least 128 bits with good dispersion (but I don't need the hash to be cryptographic)
  • reasonably fast, i.e. at least under 5ms for 1MB, 50K lines file.
  • readily available, if possible, as a library on PyPi or Conda.
  • amenable to files with repeated lines (so just XORing per-line hashes is a non-starter, as any pair of identical lines would cancel each other).

Edits and notes: Thanks to several comments, the code above is updated to sort lines in memory. The original version for order_invariant is True was:

    with os.popen('sort {}'.format(filename)) as f:
        for line in f:
            hasher.update(line.encode(encoding='utf-8'))
    return hasher.hexdigest()

The associated wall time (for the file used above) was then 238 ms. This is now reduced to 77 ms, but still way slower than not sorting the lines. Sorting will add a n*log(n) cost for n lines.

The encoding (to UTF-8) and reading in mode 'r' nor 'rb' is necessary when reading lines, as then we get strings not bytes. I don't want to rely on assuming that the files contain only ASCII data; reading in 'rb' could lead to lines not properly split. I don't have the same concern when order_invariant is False, because then I don't have to split the file, and thus the fastest way is to slurp chunks of binary data to update the hasher.

Toxoid answered 17/2, 2017 at 20:17 Comment(17)
first improvement: you could probably sort the python lines using for lines in sorted(f) instead of calling an external process...Easton
As a side note, if you're working with lines (based on 'order-invariant hash for the lines of a file' ) why are you using a buffer in the 'ordersensitive' case? Just pump the hasher with .readline() output.Rinker
Also: it's probably much faster to encode the whole file and then split the lines, then calling .encode on every line: with open(...) as f: for line in sorted(f.read().encode('utf-8').split('\n')): hasher.update(line) (yes, it loads the whole file into memory, but sorting requires this).Kippy
Or even better, open the file in binary mode, so no decoding/reencoding will take placeCorum
@ThierryLathuille - his minimal unit for hashing are lines so reading the whole file into a buffer is a moot point. Then again, he's opening the file in binary mode for some reason, too, pretty much guaranteeing that the above will not return the same hash for two exact same files with turned line ordering - in the first case he's not even hashing it line by line.Rinker
@Pierre D - have you tried using xxHash ( pypi.python.org/pypi/xxhash )?Rinker
@Jean-FrançoisFabre thanks, you are right, in the case of sort, doing it in memory is much faster than calling the system's version. It is not always the case: for example, doing gzip() compression or decompression is much faster calling the linux gzip (via os.popen('gzip -c file')) than using the library.Toxoid
Maybe hash the frequency counts of the individual characters. Different files will generally have different counts, so collisions will be rare.Paternoster
Re. comments about UTF-8 encoding or not: the cost of encoding or not is pretty trivial and linear with file size. Sorting is not. I am trying to avoid the sorting, in memory or not. As to why I need to read the lines in 'r' mode, but then I can afford to just slurp binary data when no sorting is needed: it's just to guarantee a correct split by lines even if the file contains non-ASCII text. No such concern if we are not trying to split by lines.Toxoid
@JohnColeman that won't do: 'hello' would yield the same as 'olleh'. More to the point, since there are also a bunch of numbers in those files, '3.1415926' would give the same as '1.1234569'.Toxoid
@Rinker I didn't know about xxHash. Could you please elaborate how to use it for this problem?Toxoid
@PierreD It depends on the size of the files. You mentioned 50K lines. Havind two such files with an identical frequency distribution is somewhat unlikely. Hash functions have collisions. Whether or not my suggestion is reasonable depends on the statistical profile of the files (something I am not in a position to ascertain).Paternoster
@JohnColeman I disagree: the result you suggest will be completely invariant of any ordering of all the characters of the file. The permutations that this reduces means that the collision rate will be high. Case in point, I found two collisions in 9534 different files with the simple: (for i in stash/data_*.tsv; do perl -0777 -e '$_=<>; for (sort (split(//))) {print}; print "\n"' $i | sha1sum -; done) > /tmp/out and wc -l /tmp/out vs sort -u /tmp/out | wc -l.Toxoid
@PierreD It was just a suggestion. The collision rate will depend on the nature of the files being hashed. If in your case the collision rate is too high, ignore the suggestion.Paternoster
@Rinker xxHash seems very good indeed. Extremely fast (13.8 GB/s for the 64 bit version on a 64 bit arch), and according to this, its quality is excellent with very low chance of collision. Since it can be seeded, I can update k of them in parallel with different seeds in order to achieve k*64 bits. For the order-invariant version, I can get the k digests for each line as so many ints (each 64 bits) and just keep k sums, ignoring the carry. Addition being commutative, the result will be order-invariant.Toxoid
@PierreD - that's what I was about to suggest. Since xxHash produces integers (via intdigest()) you can just sum the hashes of each line (e.g. for line in f: hash_sum += xxhash.xxh64(line).intdigest()) and then you can hash the resulting sum in the end to reduce it to 64 bits. I still doubt you'll reach your target 5ms, tho - it probably takes that much time just to go through the 50k lines file line by line, not counting the hashing overhead. You might want to look into numpy.fromfile() or pandas.read_csv() for very fast ways to get file contents into memory.Rinker
@Rinker BTW, after some testing with xxhash.xxh64, I find that while .update() is very fast over a large piece of data, the many calls to .reset(), .update() and .intdigest() make the whole thing slow. This is compelling me to look into numba's jit and code my own version of murmurhash (one that has an order-invariant function like the Scala code linked in the question).Toxoid
W
3

I think you should sort the file before (select ... from table order by ...) or come up with another solution for your actual problem.

Anyways, a possible approach in Python using a frozenset:

#!/usr/bin/python

lines1 = ['line1', 'line2', 'line3', 'line4']
lines2 = ['line2', 'line1', 'line3', 'line4']  # same as lines1 but different order
lines3 = ['line1', 'line1', 'line3', 'line4', 'line5']


for lines in [lines1, lines2, lines3]:
    print(lines)
    print(hash(frozenset(lines)))
    print('')

Output

['line1', 'line2', 'line3', 'line4']
8013284786872469720

['line2', 'line1', 'line3', 'line4']
8013284786872469720

['line1', 'line1', 'line3', 'line4', 'line5']
7430298023231386903

I doubt it will match your performance constrains. I don't know the time complexity (Big O) of the frozenset(). It also assumes lines are unique. Again, I highly suggest to tackle the underlying problem differently.

Winther answered 22/2, 2017 at 12:24 Comment(4)
Frozenset is a good fit for this task. And since duplicate lines are eliminated while building the frozenset, there is no risk of two duplicate lines cancelling each other out.Clint
The downside is that the hash value is only 32 or 64 bits.Clint
Cool, +1 for thinking of the frozenset. However, I did some measurement for increasing sizes and found that the time taken by this approach is more than linear with the number of lines. That said, the overall constants are low. For example, it takes 15ms for 65536 lines, best I've seen so far. But then it takes 7.6s for 11.8M lines, not the best.Toxoid
BTW, the files come as is from a different system over which I have little or no control; there are many of them, and I cannot affect the way they are produced. I wish they were always sorted. Often they are not.Toxoid
S
1

How about this merkle-style map-reduce (hash concatenated mapped hashes, optional sort for invariant after hash map step):

import hashlib

def hasher(data):
    hasher = hashlib.sha1()
    hasher.update(data.encode('utf-8'))
    return hasher.hexdigest()


def get_digest_by_line(filename, line_invariant=False, hasher=hasher):
    with open(filename, 'r') as f:
        hashes = (hasher(line) for line in f)
        if line_invariant:
            hashes = sorted(hashes)
        return hasher(''.join(hashes))
Siler answered 20/7, 2017 at 12:10 Comment(1)
Hey, good idea. In fact, with some minor mods it could also be distributed for very large files, e.g. in a map-reduce scheme with some binning on hash prefix.Toxoid
T
0

Thank you all for the interesting comments and answers so far.

At this time, the best answer for large files (>350K lines) is (a) below. It is based on Murmurhash3, adding the mmh3.hash128() of each line. For smaller files, it is (b) below: a variant of the frozenset approach proposed by Rolf, which I adapted to produce 128 bits hash (although I wouldn't vouch for the quality of those 128 bits).

a) mmh3.hash128() for each line and add

import mmh3
def get_digest_mmh3_128_add(filename):
    a = 0
    with open(filename, 'rb') as f:
        for line in f:
            a += mmh3.hash128(line)
    return '{:032x}'.format(a & 0xffffffffffffffffffffffffffffffff)

In my setting: constant 0.4 second per million lines.

b) two frozenset hash

def get_digest_hash_frozenset128(filename):
    with open(filename, 'rb') as f:
        frz = frozenset(f.readlines())
    return '{:032x}'.format(((hash(frz) << 64) + hash(frz.union('not a line'))) & 0xffffffffffffffffffffffffffffffff)

In my setting: between 0.2 and 0.6 second per million lines.

Notes

  1. After consideration, I decided it was ok to read the lines of the file in binary mode, even if they potentially contain UTF-8 text. The reason is, if some Unicode character contains a '\n', the line would be accidentally split at that point. The file would then get the same digest as another, where the two parts of that line were arranged differently (or even split apart and put at some other place through the file), but the probability of this is extremely slow and I can live with it.

  2. Adding all the 128-bit hashes in (a) is done using Python's arbitrary precision ints. At first, I tried to keep the sum in 128 bits (by repeatingly and-ing with the 0xfff...fff constant). But it turns out to be slower than letting Python use arbitrary precision and do the masking once at the end.

  3. I am trying to get 128 bits from the regular hash of a frozenset by taking two hashes: that of the frozenset, and another one from the frozenset augmented with a line that is unlikely to appear in any file (kind of the same as using different seed for the hash, I guess).

Complete results

A full notebook is available here. It creates pseudo-random files of arbitrary sizes, and tries several digest approaches while measuring the time taken by each of them. This is run on an EC2 instance (r3.4xlarge, using an EBS volume for the storage of the pseudo-random file) and Jupyter iPython notebook, and Python 3.6.

For 46341 lines, we get

fun                              lines millis
get_digest_xxh64_order_sensitive 46341    0.4 *
get_digest_sha1                  46341    1.7 *
get_digest_hash_frozenset64      46341    8.7
get_digest_hash_frozenset128     46341   10.8
get_digest_sha1_by_lines         46341   14.1 *
get_digest_mmh3_128_add_cy       46341   18.6
get_digest_mmh3_128_add          46341   19.7
get_digest_sha1_sort_binary      46341   44.3
get_digest_sha1_sort             46341   65.9

*: These are order-dependent, just here for comparison.

get_digest_hash_frozenset64 isn't really suitable as it gives only 64 bits.

get_digest_mmh3_128_add_cy is a cythonized version of the function given above in (a), but there is little difference.

get_digest_xxh64_order_sensitive is extremely fast, but it is order dependent. My attempts (not listed here) to derive an order-invariant version all gave some pretty slow results. The reason, I think, is the apparently high cost of initializing and finalizing the hash.

For larger files, the get_digest_mmh3_128_add_cy wins. Here is for 11.8M lines:

fun                                 lines    millis
get_digest_xxh64_order_sensitive 11863283      97.8 *
get_digest_sha1                  11863283     429.3 *
get_digest_sha1_by_lines         11863283    3453.0 *
get_digest_mmh3_128_add_cy       11863283    4692.8
get_digest_mmh3_128_add          11863283    4956.6
get_digest_hash_frozenset64      11863283    6418.2
get_digest_hash_frozenset128     11863283    7663.6
get_digest_sha1_sort_binary      11863283   27851.3
get_digest_sha1_sort             11863283   34806.4

Focusing on the two leading contenders (order invariant, not the others), here is how much time they take in function of size (number of lines). The y-axis is microseconds/line and the x-axis is number of lines of the file. Note how the get_digest_mmh3_128_add_cy spends a constant time (0.4 us) per line.

time of two order-invariant digests in function of size

Next steps

Sorry for the long-winded answer. This is only an interim answer, as I might (time permitting) try later further experimentation with either numba or Cython (or C++) of a direct implementation of Murmurhash3.

Toxoid answered 25/2, 2017 at 7:1 Comment(0)
P
0

in case someone is interested in a concise (and unbenchmarked) solution:`

def hash(data):
    return hashlib.md5(data).digest()

#order invariant (since sum is commutative) hash of strings bytewise
def hash_up(*args):
    hashes = [hash(d.encode()) for d in args]
    return ''.join(format(sum(t)  % 256 ,'02x') for t in zip(*hashes))`
Pollard answered 19/3 at 10:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.