Hex digest for Python's built-in hash function
Asked Answered
A

3

9

I need to create an identifier token from a set of nested configuration values. The token can be part of a URL, so – to make processing easier – it should contain only hexadecimal digits (or something similar). The config values are nested tuples with elements of hashable types like int, bool, str etc.

My idea was to use the built-in hash() function, as this will continue to work even if the config structure changes. This is my first attempt:

def token(config):
    h = hash(config)
    return '{:X}'.format(h)

This will produce tokens of variable length, but that doesn't matter. What bothers me, though, is that the token might contain a leading minus sign, since the return value of hash() is a signed integer.

As a way to avoid the sign, I thought of the following work-around, which is adding a constant to the hash value. This constant should be half the size of the range the value of hash() can take (which is platform-dependent, eg. different for 32-/64-bit systems):

HALF_HASH_RANGE = 2**(sys.hash_info.width-1)

Is this a sane and portable solution? Or will I shoot myself in the foot with this?

I also saw suggestions for using struct.pack() (which returns a bytes object, on which one can call the .hex() method), but it also requires knowing the range of the hash value in advance (for the choice of the right format character).

Addendum:
Encryption strength or collisions by chance are not an issue. The drawback of the hashlib library in this scenario is that it requires writing a converter that traverses the input structure and converts everything into a bytes representation, which is cumbersome.

Albata answered 14/8, 2017 at 12:28 Comment(10)
I'd be inclined to do mask = (1<<sys.hash_info.width) - 1 h = hash(config) & mask.Sharpsighted
Ooh, clever. Might not be the mostest Pythonic-est ways of all...Albata
Here's a demo of the principle using small integers: [i & 0xf for i in range(-8, 8)]. FWIW, this is a fairly standard Python idiom for converting signed integers to unsigned.Sharpsighted
Ok, thanks. Well, you're proabably right – why would Python have bit-wise operators if you shouldn't use them.Albata
Maybe you should add more info to your question. How important is it to avoid collisions? The built-in hash is very fast, and the built-in dict and set types use hash, but it's far less resistant to collisions than the cryptographic functions in hashlib which produce much larger hashes.Sharpsighted
BTW, there's a uuid module for creating unique IDs. Yes, it requires bytes input, but that's easy enough to provide if your objects have decent __repr__ methods: just encode the string returned by repr() to UTF-8.Sharpsighted
Now that's a brilliant idea – of course, I can just call repr() on the whole structure for serialising! Why hadn't I thought of that...Albata
As for Python's bitwise operators: 1<<n is significantly faster than 2**n, although many Python coders consider the latter to be more readable, unless you're already doing other bitwise stuff. And of course 1<<n will raise ValueError if n is negative.Sharpsighted
Are these hash values intended to be used beyond a single run of your program? If so, you CANNOT use the built-in hash() - it's not guaranteed to be calculated the same way in all Python versions, and at some point string hashes started being intentionally being randomized on each program run.Julieannjulien
Thanks, @Julieannjulien – they're not used across multiple runs. I knew somebody would bring it up, but I didn't want to put too much information into the post.Albata
R
4

You can use any of hash functions for getting unique string. Right now python support out of the box many algorithms, like: md5, sha1, sha224, sha256, sha384, sha512. You can read more about it here - https://docs.python.org/2/library/hashlib.html

This example shows how to use library hashlib. (Python 3)

>>> import hashlib
>>> sha = hashlib.sha256()
>>> sha.update('somestring'.encode())
>>> sha.hexdigest()
>>> '63f6fe797026d794e0dc3e2bd279aee19dd2f8db67488172a644bb68792a570c'

Also you can try library hashids. But note that it's not a hash algorithm and you (and anyone who knows salt) can decrypt data.

$ pip install hashids

Basic usage:

>>> from hashids import Hashids
>>> hashids = Hashids()
>>> hashids.encode(123)
'Mj3'
>>> hashids.decode('Mj3')
123
Roos answered 14/8, 2017 at 12:51 Comment(3)
Thanks – I'm aware of the hashlib module. What is inconvenient about it is that it only takes a flat bytes sequence as input. In my scenario this means traversing the config structure and converting each value to some bytes representation, which is cumbersome and error-prone.Albata
That's right, but you can't use hash for some types like: list, dict, set, so you need to convert this structures as for hashlib module. You can check it manually >>>hash({}).Roos
That's why I wrote in my post (first paragraph): “The config values are nested tuples with elements of hashable types like int, bool, str etc.” Note the word hashable.Albata
E
1

I need to create an identifier token from a set of nested configuration values

I came across this question while trying to solve the same problem, and realizing that some of the calls to hash return negative integers.

Here's how I would implement your token function:

import sys


def token(config) -> str:
    """Generates a hex token that identifies a hashable config."""
    # `sign_mask` is used to make `hash` return unsigned values
    sign_mask = (1 << sys.hash_info.width) - 1
    # Get the hash as a positive hex value with consistent padding without '0x'
    return f'{hash(config) & sign_mask:#0{sys.hash_info.width//4}x}'[2:]

In my case I needed it to work with a broad range of inputs for the config. It did not need to be particularly performant (it was not on a hot path), and it was acceptable if it occasionally had collisions (more than what would normally be expected from hash). All it really needed to do is produce short (e.g. 16 chars long) consistent outputs for consistent inputs. So for my case I used the above function with a small modification to ensure the provided config is hashable, at the cost of increased collision risk and processing time:

import sys


def token(config) -> str:
    """Generates a hex token that identifies a config."""
    # `sign_mask` is used to make `hash` return unsigned values
    sign_mask = (1 << sys.hash_info.width) - 1
    # Use `json.dumps` with `repr` to ensure the config is hashable
    json_config = json.dumps(config, default=repr)
    # Get the hash as a positive hex value with consistent padding without '0x'
    return f'{hash(json_config) & sign_mask:#0{sys.hash_info.width//4}x}'[2:]
Elliot answered 21/10, 2021 at 22:2 Comment(0)
F
-1

I'd reccomend using hashlib

cast the token to a string, and then cast the hexdigest to a hex integer. Bellow is an example with the sha256 algorithm but you can use any hashing algorithm hashlib supports

import hashlib as hl
def shasum(token):
    return int(hl.sha256(str(token).encode('utf-8')).hexdigest(), 16)
Fellini answered 21/10, 2021 at 22:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.