Murmur3 hash different result between Python and Java implementation
Asked Answered
B

2

5

I have two different program that wish to hash same string using Murmur3 in Python and Java respectively.

Python version 2.7.9:

mmh3.hash128('abc')

Gives 79267961763742113019008347020647561319L.

Java is Guava 18.0:

HashCode hashCode = Hashing.murmur3_128().newHasher().putString("abc", StandardCharsets.UTF_8).hash();

Gives string "6778ad3f3f3f96b4522dca264174a23b", converting to BigInterger gives 137537073056680613988840834069010096699.

How to get same result from both?

Thanks

Borchert answered 29/4, 2015 at 1:47 Comment(0)
H
9

Here's how to get the same result from both:

byte[] mm3_le = Hashing.murmur3_128().hashString("abc", UTF_8).asBytes();
byte[] mm3_be = Bytes.toArray(Lists.reverse(Bytes.asList(mm3_le)));
assertEquals("79267961763742113019008347020647561319",
    new BigInteger(mm3_be).toString());

The hash code's bytes need to be treated as little endian but BigInteger interprets bytes as big endian. You were presumably using new BigInteger(hex, 16) to create the BigInteger, but the output of HashCode.toString() is actually a series of pairs of hexadecimal digits representing the hash bytes in the same order they're returned by asBytes() (little endian). (You can also reverse those pairs of hexadecimal to get a hex number that does produce the same result when passed to new BigInteger(reversedHex, 16)).

I think the documentation of toString() is somewhat confusing because of the way it refers to "big endian"; it doesn't actually mean that the output of the method is the hexadecimal number representing the bytes interpreted as big endian.

We have an open issue for adding asBigInteger() to HashCode.

Halcyon answered 29/4, 2015 at 16:46 Comment(0)
R
5

If anyone is interested in the reverse answer, converting the python output to the Java output:

import mmh3
import string

char_array = '0123456789abcdef'
mumrmur = mmh3.hash_bytes('abc')

result = [f'{string.hexdigits[(char >> 4) & 0xf]}{string.hexdigits[char & 0xf]}' for char in mumrmur]
print(''.join(result))
Rampageous answered 15/4, 2018 at 8:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.