I need to create an identifier token from a set of nested configuration values.
The token can be part of a URL, so – to make processing easier – it should contain only hexadecimal digits (or something similar).
The config values are nested tuples with elements of hashable types like int
, bool
, str
etc.
My idea was to use the built-in hash()
function, as this will continue to work even if the config structure changes.
This is my first attempt:
def token(config):
h = hash(config)
return '{:X}'.format(h)
This will produce tokens of variable length, but that doesn't matter.
What bothers me, though, is that the token might contain a leading minus sign, since the return value of hash()
is a signed integer.
As a way to avoid the sign, I thought of the following work-around, which is adding a constant to the hash value.
This constant should be half the size of the range the value of hash()
can take (which is platform-dependent, eg. different for 32-/64-bit systems):
HALF_HASH_RANGE = 2**(sys.hash_info.width-1)
Is this a sane and portable solution? Or will I shoot myself in the foot with this?
I also saw suggestions for using struct.pack()
(which returns a bytes
object, on which one can call the .hex()
method), but it also requires knowing the range of the hash value in advance (for the choice of the right format character).
Addendum:
Encryption strength or collisions by chance are not an issue.
The drawback of the hashlib
library in this scenario is that it requires writing a converter that traverses the input structure and converts everything into a bytes
representation, which is cumbersome.
mask = (1<<sys.hash_info.width) - 1
h = hash(config) & mask
. – Sharpsighted[i & 0xf for i in range(-8, 8)]
. FWIW, this is a fairly standard Python idiom for converting signed integers to unsigned. – Sharpsightedhash
is very fast, and the built-indict
andset
types usehash
, but it's far less resistant to collisions than the cryptographic functions inhashlib
which produce much larger hashes. – Sharpsighted__repr__
methods: just encode the string returned byrepr()
to UTF-8. – Sharpsightedrepr()
on the whole structure for serialising! Why hadn't I thought of that... – Albata1<<n
is significantly faster than2**n
, although many Python coders consider the latter to be more readable, unless you're already doing other bitwise stuff. And of course1<<n
will raiseValueError
ifn
is negative. – Sharpsightedhash()
- it's not guaranteed to be calculated the same way in all Python versions, and at some point string hashes started being intentionally being randomized on each program run. – Julieannjulien