Commutative hash function for uint32_t value pairs

Asked 20/7, 2013 at 18:43 Answered 20/7, 2013 at 19:29

I need a fast, simple hash function that creates a unique identifier for a pair of uint32_t values - so the same hash value for (2,7) and (7,2).

Any idea?

Calliopsis answered 20/7, 2013 at 18:43 Comment(2)

Create a uint64 by bitshifting the smaller (or larger) of the two numbers and adding the other. Then you just need to hash the 64bit int. (Alternatively, have the hash function copy the pair, then swap the elements to guarantee order and apply the real hash on that pair) – Fatherless 20/7, 2013 at 18:49

@DavidRodríguez-dribeas: Yeah, meanwhile I came up with the solution, thanks. I already had the bitshift stuff for it, but you remembered me that the comparison is the magic. – Calliopsis 20/7, 2013 at 18:50

To answer my own question, the solution is:

uint64_t hash(uint32_t x, uint32_t y)
{
    const uint64_t a = static_cast<uint64_t>(x);
    const uint64_t b = static_cast<uint64_t>(y);

    if (x < y) return (b << 32) | a;
    else return (a << 32) | b;
}

Which can be improved to the branchless version

uint64_t hash(uint32_t x, uint32_t y)
{
    const uint64_t a = static_cast<uint64_t>(x);
    const uint64_t b = static_cast<uint64_t>(y);

    const uint64_t h0 = (b << 32) | a;
    const uint64_t h1 = (a << 32) | b;

    return (x < y) ? h0 : h1; // conditional move (CMOV) instruction
}

These methods are perfect hash functions - they guarantee zero collisions. However, they have the disadvantage that you cannot hash values above 2^32 - 1.

Calliopsis answered 20/7, 2013 at 19:29 Comment(5)

I like the idea of shifting, it's natural and there is no need to proof uniqueness. If you want to deal with values above 2^32 you can return a string as a unique identifier, where you reserve a special symbol to separate two parts of a hash (changing representation base to larger than 10 is a good idea also) – Yukyukaghir 20/7, 2013 at 22:51

"changing representation base to larger than 10 is a good idea also" - how you mean that? – Calliopsis 20/7, 2013 at 22:59

if you are representing number as a string, then you can use a larger alphabet than {0,1, ..., 9} e.g. hexadecimal system or even larger. The larger base the shorter representation. – Yukyukaghir 20/7, 2013 at 23:5

I have used if (x>y) when I had the exact same need as yours, but now that I think about it, hash(x) ^ hash(y) would have done as well (with the issue that all pairs of identical elements (x, x) have the same hash. That can be a problem for some applications) – Keary 28/7, 2013 at 13:52

@Calliopsis signed int right shifting is implementation-defined, and your max << 32 invokes undefined behaviour (because you are shifting by an amount >= datatype size). Only after the | the result is guaranteed to be cast to uint64_t. In practice your compiler is doing 64bit computation anyway, but this cannot be assumed for sure. – Monteux 6/3, 2017 at 10:18

constexpr uint32_t hash_max = ...;    

constexpr uint32_t commutative_hash(uint32_t i, uint32_t j) {
   return (i*j + (i*i)*(j*j) + (i*i*i)*(j*j*j)) % hash_max;
};

Extra parentheses are for compiler - it will be easier to optimize this expression.

Do not use any conditional instruction (or std::max/std::min) which breaks CPU pipeline (and is slow) if you want to make a fast function.

Illusionary answered 20/7, 2013 at 18:57 Comment(6)

Thanks, but this is extremely slow due to the lot of multiplications. See my answer. – Calliopsis 20/7, 2013 at 19:30

I can bet that my function is faster. Did you benchmark your function against mine? I've added explanation in the answer why it is faster. – Illusionary 20/7, 2013 at 19:39

@LeonidVolnitsky +1 for the right explanation. The conditional statement might confuse branch prediction in some cases. – Torietorii 20/7, 2013 at 19:44

@LeonidVolnitsky: Indeed you are right, yours is faster. I'm surprised that conditionals so much slow down the things. Could you explain it in more details? – Calliopsis 20/7, 2013 at 20:10

On modern CPUs with pipeline - yes. See also sites.google.com/site/dataanxiety/blog/… – Illusionary 20/7, 2013 at 20:22

I edited your function to correctly handle zero inputs. See my answer update. – Calliopsis 20/7, 2013 at 20:59

Recommended topics

Hot tags