Positive integer from Python hash() function

L

5

36

I want to use the Python hash() function to get integer hashes from objects. But built-in hash() can give negative values, and I want only positive. And I want it to work sensibly on both 32-bit and 64-bit platforms.

I.e. on 32-bit Python, hash() can return an integer in the range -2**31 to 2**31 - 1. On 64-bit systems, hash() can return an integer in the range -2**63 to 2**63 - 1.

But I want a hash in the range 0 to 2**32-1 on 32-bit systems, and 0 to 2**64-1 on 64-bit systems.

What is the best way to convert the hash value to its equivalent positive value within the range of the 32- or 64-bit target platform?

(Context: I'm trying to make a new random.Random style class. According to the random.Random.seed() docs, the seed "optional argument x can be any hashable object." So I'd like to duplicate that functionality, except that my seed algorithm can't handle negative integer values, only positive.)

Landes answered 12/9, 2013 at 14:7 Comment(0)

T

2

To support platforms with signed and unsigned hash(), you could use

hash(x) % 2**sys.hash_info.width

This will use the actual hash width as reported by Python rather than a guess from what Python deems to be the maximum size of a list on the platform. The % operation will map negative numbers to positive numbers, e.g. -1 will become 2**sys.hash_info.width - 1.

Note that if x is a positive integer close to 0, hash(x) is the identity function, i.e. it just passes through the value. In general, on 64-bit with Python 3.6, it seems to compute


(abs(x) % m) * (-1 if x<0 else 1)

with m=2**61-1, the 9th Mersenne prime. This can be problematic in some applications, e.g. filling a hash table with 1000 entries with values x = int(time_YYYMMDDhhmm) would result into a higher density of items at indices 0 to 359 than for 400 to 959 (and zero density in any range k*100+60 to k*100 + 99).

Thyrotoxicosis answered 27/3, 2024 at 14:43 Comment(3)

That sounds good. sys.hash_info was new in Python 3.2. At the time this question was originally asked in 2013, we had to consider Python versions where this didn't exist. But now in 2024 it's pretty safe to rely on it. – Landes 28/3, 2024 at 0:59

What is the difference between sys.hash_info.width and sys.hash_info.hash_bits? It sounds as though potentially, the hash output might use a truncation of the internal hash algorithm's output, so sys.hash_info.width ≤ sys.hash_info.hash_bits, and we should use sys.hash_info.width. – Landes 28/3, 2024 at 1:3

@CraigMcQueen Both the C-API and PEP 456 say that hash_bits is internal. It probably is at least width as you say but there is no guarantee. I'd say if these details are important in an application, write your own hash function. – Thyrotoxicosis 4/4, 2024 at 11:41

C

32

Using sys.maxsize:

>>> import sys
>>> sys.maxsize
9223372036854775807L
>>> hash('asdf')
-618826466
>>> hash('asdf') % ((sys.maxsize + 1) * 2)
18446744073090725150L

Alternative using ctypes.c_size_t:

>>> import ctypes
>>> ctypes.c_size_t(hash('asdf')).value
18446744073090725150L

Chante answered 12/9, 2013 at 14:14 Comment(5)

There's no reason whatsoever to use a modulus here. I mean sure it works but it's less efficient and harder to read. – Te 12/9, 2013 at 14:24

@CraigMcQueen, I added an alternative way. Check it out. – Chante 12/9, 2013 at 14:41

It's dangerous to assume that sys.maxsize will always be 2**31-1 on 32-bit and 2**63-1 on 64-bit platforms. Check for the expected values with if and elif and raise an exception if you encounter an unexpected value. – Thyrotoxicosis 27/3, 2024 at 14:32

Note that hash(x) is limited to the 9th Mersenne prime 2**61-1 for integer input on 64 bit. Probably some prime smaller than 2**32 is used on 32 bit. This is all undocumented and should not be relied on. – Thyrotoxicosis 27/3, 2024 at 14:35

Use sys.hash_info.width (see my answer) to use the actual hash width rather than assuming it always coincides with the platform word width. – Thyrotoxicosis 27/3, 2024 at 14:45

T

10

Just using sys.maxsize is wrong for obvious reasons (it being `2*n-1 and not 2*n), but the fix is easy enough:

h = hash(obj)
h += sys.maxsize + 1

for performance reasons you may want to split the sys.maxsize + 1 into two separate assignments to avoid creating a long integer temporarily for most negative numbers. Although I doubt this is going to matter much

Te answered 12/9, 2013 at 14:21 Comment(4)

You code could produce duplicated value. Try -0x7fffffffffffffff + sys.maxsize + 1 in 64bit system. – Chante 12/9, 2013 at 14:35

Ah yes true enough, shouldn't be conditional. Where's my head today? – Te 12/9, 2013 at 14:38

Why /2 instead of *2? – Chante 12/9, 2013 at 14:40

Actually it's neither / nor * 2. We have a range of [-2**31, 2**31-1], but we want [-2**31+2**31, 2**31-1+2**31] (example is for 32bit system). So it's just an addition by the lower boundary (2**31).. really confused today I am. – Te 12/9, 2013 at 14:41

H

3

(Edit: at first I thought you always wanted a 32-bit value)

Simply AND it with a mask of the desired size. Generally sys.maxsize will already be such a mask, since it's a power of 2 minus 1.

import sys
assert (sys.maxsize & (sys.maxsize+1)) == 0 # checks that maxsize+1 is a power of 2 

new_hash = hash & sys.maxsize

Hinrichs answered 12/9, 2013 at 14:28 Comment(3)

That's a good idea, although I want a 64-bit (or 32-bit) value, not a 63-bit (or 31-bit) value. – Landes 13/9, 2013 at 1:24

@CraigMcQueen, sorry I thought that maxsize was already the right size. – Hinrichs 13/9, 2013 at 2:5

Like that you check the assumption that sys.maxsize is 2**k-1 – Thyrotoxicosis 27/3, 2024 at 14:42

T

2

To support platforms with signed and unsigned hash(), you could use

hash(x) % 2**sys.hash_info.width

This will use the actual hash width as reported by Python rather than a guess from what Python deems to be the maximum size of a list on the platform. The % operation will map negative numbers to positive numbers, e.g. -1 will become 2**sys.hash_info.width - 1.

Note that if x is a positive integer close to 0, hash(x) is the identity function, i.e. it just passes through the value. In general, on 64-bit with Python 3.6, it seems to compute


(abs(x) % m) * (-1 if x<0 else 1)

with m=2**61-1, the 9th Mersenne prime. This can be problematic in some applications, e.g. filling a hash table with 1000 entries with values x = int(time_YYYMMDDhhmm) would result into a higher density of items at indices 0 to 359 than for 400 to 959 (and zero density in any range k*100+60 to k*100 + 99).

Thyrotoxicosis answered 27/3, 2024 at 14:43 Comment(3)

That sounds good. sys.hash_info was new in Python 3.2. At the time this question was originally asked in 2013, we had to consider Python versions where this didn't exist. But now in 2024 it's pretty safe to rely on it. – Landes 28/3, 2024 at 0:59

What is the difference between sys.hash_info.width and sys.hash_info.hash_bits? It sounds as though potentially, the hash output might use a truncation of the internal hash algorithm's output, so sys.hash_info.width ≤ sys.hash_info.hash_bits, and we should use sys.hash_info.width. – Landes 28/3, 2024 at 1:3

@CraigMcQueen Both the C-API and PEP 456 say that hash_bits is internal. It probably is at least width as you say but there is no guarantee. I'd say if these details are important in an application, write your own hash function. – Thyrotoxicosis 4/4, 2024 at 11:41

S

1

How about:

h = hash(o)
if h < 0:
  h += sys.maxsize

This uses sys.maxsize to be portable between 32- and 64-bit systems.

Submariner answered 12/9, 2013 at 14:14 Comment(1)

You could still end up with a value of -1. Other than that corner-case, the resulting values are in the range of a positive 31-bit number (not 32-bit). – Landes 28/3, 2024 at 1:32

Recommended topics

Hot tags