Commutative hash function for uint32_t value pairs
Asked Answered
C

2

8

I need a fast, simple hash function that creates a unique identifier for a pair of uint32_t values - so the same hash value for (2,7) and (7,2).

Any idea?

Calliopsis answered 20/7, 2013 at 18:43 Comment(2)
Create a uint64 by bitshifting the smaller (or larger) of the two numbers and adding the other. Then you just need to hash the 64bit int. (Alternatively, have the hash function copy the pair, then swap the elements to guarantee order and apply the real hash on that pair)Fatherless
@DavidRodríguez-dribeas: Yeah, meanwhile I came up with the solution, thanks. I already had the bitshift stuff for it, but you remembered me that the comparison is the magic.Calliopsis
C
6

To answer my own question, the solution is:

uint64_t hash(uint32_t x, uint32_t y)
{
    const uint64_t a = static_cast<uint64_t>(x);
    const uint64_t b = static_cast<uint64_t>(y);

    if (x < y) return (b << 32) | a;
    else return (a << 32) | b;
}

Which can be improved to the branchless version

uint64_t hash(uint32_t x, uint32_t y)
{
    const uint64_t a = static_cast<uint64_t>(x);
    const uint64_t b = static_cast<uint64_t>(y);

    const uint64_t h0 = (b << 32) | a;
    const uint64_t h1 = (a << 32) | b;

    return (x < y) ? h0 : h1; // conditional move (CMOV) instruction
}

These methods are perfect hash functions - they guarantee zero collisions. However, they have the disadvantage that you cannot hash values above 2^32 - 1.

Calliopsis answered 20/7, 2013 at 19:29 Comment(5)
I like the idea of shifting, it's natural and there is no need to proof uniqueness. If you want to deal with values above 2^32 you can return a string as a unique identifier, where you reserve a special symbol to separate two parts of a hash (changing representation base to larger than 10 is a good idea also)Yukyukaghir
"changing representation base to larger than 10 is a good idea also" - how you mean that?Calliopsis
if you are representing number as a string, then you can use a larger alphabet than {0,1, ..., 9} e.g. hexadecimal system or even larger. The larger base the shorter representation.Yukyukaghir
I have used if (x>y) when I had the exact same need as yours, but now that I think about it, hash(x) ^ hash(y) would have done as well (with the issue that all pairs of identical elements (x, x) have the same hash. That can be a problem for some applications)Keary
@Calliopsis signed int right shifting is implementation-defined, and your max << 32 invokes undefined behaviour (because you are shifting by an amount >= datatype size). Only after the | the result is guaranteed to be cast to uint64_t. In practice your compiler is doing 64bit computation anyway, but this cannot be assumed for sure.Monteux
I
2
constexpr uint32_t hash_max = ...;    

constexpr uint32_t commutative_hash(uint32_t i, uint32_t j) {
   return (i*j + (i*i)*(j*j) + (i*i*i)*(j*j*j)) % hash_max;
};

Extra parentheses are for compiler - it will be easier to optimize this expression.

Do not use any conditional instruction (or std::max/std::min) which breaks CPU pipeline (and is slow) if you want to make a fast function.

Illusionary answered 20/7, 2013 at 18:57 Comment(6)
Thanks, but this is extremely slow due to the lot of multiplications. See my answer.Calliopsis
I can bet that my function is faster. Did you benchmark your function against mine? I've added explanation in the answer why it is faster.Illusionary
@LeonidVolnitsky +1 for the right explanation. The conditional statement might confuse branch prediction in some cases.Torietorii
@LeonidVolnitsky: Indeed you are right, yours is faster. I'm surprised that conditionals so much slow down the things. Could you explain it in more details?Calliopsis
On modern CPUs with pipeline - yes. See also sites.google.com/site/dataanxiety/blog/…Illusionary
I edited your function to correctly handle zero inputs. See my answer update.Calliopsis

© 2022 - 2024 — McMap. All rights reserved.