Is there a pure python implementation of MurmurHash?
Asked Answered
R

5

26

I need (and can't find) a pure python (no c++) implementation of MurmurHash, and I'm too novice to write myself. Speed or memory usage doesn't matter on my project.

I find a attempt here, but it's limited to 31bits hash and I really need a 64bits hash.

Note : for those who need a fast implementation, there is a MurmurHash2 library here and a MurmurHash3 library here

Remediosremedy answered 9/11, 2012 at 9:28 Comment(4)
Why do you want a pure Python implementation?Fusil
I need a pure Python implementation because my application needs to run on platforms that can't execute non-python code (like Google App Engine).Remediosremedy
There is also this one, but I think the mmh3 one you found looks better cared for. code.google.com/p/pyfasthashGastritis
C/C++, not pure python... ("It provide several common hash algorithms with C/C++ implementation for performance")Remediosremedy
M
7

Another pure python implementation of MurmurHash3 that is totally compatible and replaceable by the mmh3 wrapper, but still limited to 32-bits Murmur3: https://github.com/wc-duck/pymmh3

Moo answered 27/2, 2015 at 14:27 Comment(0)
N
12

this is untested (sorry!), but here is a version I came up with. Python allows for arbitrarily large integers, so I created a mask for the first 8 bytes (or 64 bits) that I then apply (via bitwise AND) to all arithmetic results that could produce integers larger than 64bits. Maybe somebody else could comment on the general approach and possible issues with endianness, etc.

def bytes_to_long(bytes):
    assert len(bytes) == 8
    return sum((b << (k * 8) for k, b in enumerate(bytes)))


def murmur(data, seed):

    m = 0xc6a4a7935bd1e995
    r = 47

    MASK = 2 ** 64 - 1

    data_as_bytes = bytearray(data)

    h = seed ^ ((m * len(data_as_bytes)) & MASK)

    for ll in range(0, len(data_as_bytes), 8):
        k = bytes_to_long(data_as_bytes[ll:ll + 8])
        k = (k * m) & MASK
        k = k ^ ((k >> r) & MASK)
        k = (k * m) & MASK
        h = (h ^ k)
        h = (h * m) & MASK

    l = len(data_as_bytes) & 7

    if l >= 7:
        h = (h ^ (data_as_bytes[6] << 48))

    if l >= 6:
        h = (h ^ (data_as_bytes[5] << 40))

    if l >= 5:
        h = (h ^ (data_as_bytes[4] << 32))

    if l >= 4:
        h = (h ^ (data_as_bytes[3] << 24))

    if l >= 3:
        h = (h ^ (data_as_bytes[4] << 16))

    if l >= 2:
        h = (h ^ (data_as_bytes[4] << 8))

    if l >= 1:
        h = (h ^ data_as_bytes[4])
        h = (h * m) & MASK

    h = h ^ ((h >> r) & MASK)
    h = (h * m) & MASK
    h = h ^ ((h >> r) & MASK)

    return h
Nuthouse answered 2/4, 2013 at 1:50 Comment(0)
A
10

fix the bugs in Nikolas's version

def bytes_to_long(bytes):
    assert len(bytes) == 8
    return sum((b << (k * 8) for k, b in enumerate(bytes)))


def murmur64(data, seed = 19820125):

    m = 0xc6a4a7935bd1e995
    r = 47

    MASK = 2 ** 64 - 1

    data_as_bytes = bytearray(data)

    h = seed ^ ((m * len(data_as_bytes)) & MASK)

    off = len(data_as_bytes)/8*8
    for ll in range(0, off, 8):
        k = bytes_to_long(data_as_bytes[ll:ll + 8])
        k = (k * m) & MASK
        k = k ^ ((k >> r) & MASK)
        k = (k * m) & MASK
        h = (h ^ k)
        h = (h * m) & MASK

    l = len(data_as_bytes) & 7

    if l >= 7:
        h = (h ^ (data_as_bytes[off+6] << 48))

    if l >= 6:
        h = (h ^ (data_as_bytes[off+5] << 40))

    if l >= 5:
        h = (h ^ (data_as_bytes[off+4] << 32))

    if l >= 4:
        h = (h ^ (data_as_bytes[off+3] << 24))

    if l >= 3:
        h = (h ^ (data_as_bytes[off+2] << 16))

    if l >= 2:
        h = (h ^ (data_as_bytes[off+1] << 8))

    if l >= 1:
        h = (h ^ data_as_bytes[off])
        h = (h * m) & MASK

    h = h ^ ((h >> r) & MASK)
    h = (h * m) & MASK
    h = h ^ ((h >> r) & MASK)

    return h
Already answered 29/1, 2015 at 5:51 Comment(2)
For Python3 change off = len(data_as_bytes)/8*8 to off = int(len(data_as_bytes)/8)*8.Hancock
To compliment @zhuhongk's answer, Python version returns unsigned number, to return signed number like Java version, add following line in the end: ``` h = ctypes.c_long(h).value ```Brumaire
M
7

Another pure python implementation of MurmurHash3 that is totally compatible and replaceable by the mmh3 wrapper, but still limited to 32-bits Murmur3: https://github.com/wc-duck/pymmh3

Moo answered 27/2, 2015 at 14:27 Comment(0)
P
6

Here goes a pure Python implementation of MurmurHash3 32, it hashes strings only but can be easily adapted to take byte arrays instead. This is a port from the Java version of the algorithm.

def murmur3_x86_32(data, seed = 0):
    c1 = 0xcc9e2d51
    c2 = 0x1b873593

    length = len(data)
    h1 = seed
    roundedEnd = (length & 0xfffffffc)  # round down to 4 byte block
    for i in range(0, roundedEnd, 4):
      # little endian load order
      k1 = (ord(data[i]) & 0xff) | ((ord(data[i + 1]) & 0xff) << 8) | \
           ((ord(data[i + 2]) & 0xff) << 16) | (ord(data[i + 3]) << 24)
      k1 *= c1
      k1 = (k1 << 15) | ((k1 & 0xffffffff) >> 17) # ROTL32(k1,15)
      k1 *= c2

      h1 ^= k1
      h1 = (h1 << 13) | ((h1 & 0xffffffff) >> 19)  # ROTL32(h1,13)
      h1 = h1 * 5 + 0xe6546b64

    # tail
    k1 = 0

    val = length & 0x03
    if val == 3:
        k1 = (ord(data[roundedEnd + 2]) & 0xff) << 16
    # fallthrough
    if val in [2, 3]:
        k1 |= (ord(data[roundedEnd + 1]) & 0xff) << 8
    # fallthrough
    if val in [1, 2, 3]:
        k1 |= ord(data[roundedEnd]) & 0xff
        k1 *= c1
        k1 = (k1 << 15) | ((k1 & 0xffffffff) >> 17)  # ROTL32(k1,15)
        k1 *= c2
        h1 ^= k1

    # finalization
    h1 ^= length

    # fmix(h1)
    h1 ^= ((h1 & 0xffffffff) >> 16)
    h1 *= 0x85ebca6b
    h1 ^= ((h1 & 0xffffffff) >> 13)
    h1 *= 0xc2b2ae35
    h1 ^= ((h1 & 0xffffffff) >> 16)

    return h1 & 0xffffffff
Phonemic answered 30/5, 2014 at 14:53 Comment(1)
Caution: The algorithm is returning an unsigned python integer (>=0) while standard Java solutions return an 32bit signed integer (2 complement)Downtoearth
G
1

I propose to change the bytes_to_long function so that values with less than eight bytes are possible:

def bytes_to_long(bytes):
    length = len(bytes)
    if length < 8:
        extra = 8 - length
        bytes = b'\000' * extra + bytes
    assert len(bytes) == 8
   return sum((b << (k * 8) for k, b in enumerate(bytes)))

This fills the bytearray with Null bytes so that the assert stills works (you could remove it now) as intended but allows for the conversion of smaller values.

Greggs answered 22/5, 2018 at 10:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.