perfect hash function
Asked Answered
M

7

21

I'm attempting to hash the values

10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0

I need a function that will map them to an array that has a size of 13 without causing any collisions.

I've spent several hours thinking this over and googling and can't figure this out. I haven't come close to a viable solution.

How would I go about finding a hash function of this sort? I've played with gperf, but I don't really understand it and I couldn't get the results I was looking for.

Mchugh answered 9/11, 2010 at 6:4 Comment(4)
This sounds like homework ... anyway, write a program to do it for you! :-) Come up with a generic formula, likely using pow or a bit-wise operation, and modulus (hey, there is already an example in an answer!), and then have the computer plunk through values until there is a "perfect hash function match found". I did this for my CS homework years ago and it worked great ;-)Heavyset
It sounds as though you're trying to find a minimal perfect hash function.Witling
On second thoughts... you've got 11 data points. Why do you want to map to an array of size 13? What's the significance of that number 13?Witling
I ran your numbers into 'gperf' and it produced a perfect hash function. Look at the output you got and you'll see a function called 'hash' in there.Strict
C
13

Found One

I tried a few things and found one semi-manually:

(n ^ 28) % 13

The semi-manual part was the following ruby script that I used to test candidate functions with a range of parameters:

t = [10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0]
(1..200).each do |i|
  t2 = t.map { |e| (e ^ i) % 13 }
  puts i if t2.uniq.length == t.length
end
Caravansary answered 9/11, 2010 at 6:17 Comment(0)
S
24

if you know the exact keys then it is trivial to produce a perfect hash function -

int hash (int n) {
  switch (n) {
    case 10:   return 0;
    case 100:  return 1;
    case 32:   return 2;
    // ...
    default:   return -1;
  }
}
Selenography answered 9/11, 2010 at 6:10 Comment(8)
Sorry, I edited it to be C++, then the C++ tag was removed from the question. You can revert if you want.Larochelle
meh, it was supposed to be C-like pseudo-code, it is better now, tySelenography
That is effectively a table search. One of the benefits of a hash function is supposed to be: it allows a faster evaluation of hit/miss than a table search.Witling
@Craig this is demonstrably faster than a table search, even with gcc -O0 this is 5 times faster than a simple linear table search on my machine, with -O2 the linear search takes over a second and time reports a total time of 0.00 for 1 million lookups... It is nearly identical to the accepted answer in speed over 100 million iterations with -O0 and within 0.2s with -O2. And this hash function is faster if all you need to do is work out if the key is present/valid - hash(n)==-1 requires no memory access... and you can safely add keys with the function remaining perfect.Selenography
Your hash might be faster than a data table look-up agred, but that's not what I meant. To clarify: Your hash might work well for this small data set, agreed. But for a data set of size n, your hash is O(n), whereas the accepted answer is O(1). The point of a hash function is to provide an O(1) solution. So in the particular case of the data set of the original question, your solution is satisfactory, but in the more general case of "finding a perfect hash function for data sets" (in particular, larger than some threshold), your answer isn't suitable.Witling
But a perfect hash function does not make sense for anything larger than this number of values, and it's not a question about hash functions in general. Yet this is an effective method for efficiently implementing perfect hash functions in the range where they are appropriate. Also, the compiler can optimize to a jump table or search tree. It's not necessarily O(n) but often O(1) or O(log(n))Selenography
"But a perfect hash function does not make sense for anything larger than this number of values"—I've just managed to make a [non-minimal] perfect hash function that hashes my fixed data set of 154 16-bit values into an 8-bit output, without collisions. Then if I want to make a minimal perfect hash function out of that, I can do it with an extra 256-entry look-up table.Witling
A good optimizing compiler will in the worst case produce an O(log(n)) lookup to find the correct switch case, which defeats the purpose of using the hash function for a hash table. In addition, you might want to derive the perfect hash function at runtime sometimes.Emulsoid
C
13

Found One

I tried a few things and found one semi-manually:

(n ^ 28) % 13

The semi-manual part was the following ruby script that I used to test candidate functions with a range of parameters:

t = [10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0]
(1..200).each do |i|
  t2 = t.map { |e| (e ^ i) % 13 }
  puts i if t2.uniq.length == t.length
end
Caravansary answered 9/11, 2010 at 6:17 Comment(0)
W
5

On some platforms (e.g. embedded), modulo operation is expensive, so % 13 is better avoided. But AND operation of low-order bits is cheap, and equivalent to modulo of a power-of-2.

I tried writing a simple program (in Python) to search for a perfect hash of your 11 data points, using simple forms such as ((x << a) ^ (x << b)) & 0xF (where & 0xF is equivalent to % 16, giving a result in the range 0..15, for example). I was able to find the following collision-free hash which gives an index in the range 0..15 (expressed as a C macro):

#define HASH(x)    ((((x) << 2) ^ ((x) >> 2)) & 0xF)

Here is the Python program I used:

data = [ 10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0 ]

def shift_right(value, shift_value):
    """Shift right that allows for negative values, which shift left
    (Python shift operator doesn't allow negative shift values)"""
    if shift_value == None:
        return 0
    if shift_value < 0:
        return value << (-shift_value)
    else:
        return value >> shift_value

def find_hash():
    def hashf(val, i, j = None, k = None):
        return (shift_right(val, i) ^ shift_right(val, j) ^ shift_right(val, k)) & 0xF

    for i in xrange(-7, 8):
        for j in xrange(i, 8):
            #for k in xrange(j, 8):
                #j = None
                k = None
                outputs = set()
                for val in data:
                    hash_val = hashf(val, i, j, k)
                    if hash_val >= 13:
                        pass
                        #break
                    if hash_val in outputs:
                        break
                    else:
                        outputs.add(hash_val)
                else:
                    print i, j, k, outputs

if __name__ == '__main__':
    find_hash()
Witling answered 8/8, 2011 at 0:16 Comment(0)
A
3

Bob Jenkins has a program for this too: http://burtleburtle.net/bob/hash/perfect.html

Unless you're very lucky, there's no "nice" perfect hash function for a given dataset. Perfect hashing algorithms usually use a simple hashing function on the keys (using enough bits so it's collision-free) then use a table to finish it off.

Accessory answered 9/11, 2010 at 6:17 Comment(0)
B
3

Just some quasi-analytical ramblings:

In your set of numbers, eleven in all, three are odd and eight are even. Looking at the simplest forms of hashing - %13 - will give you the following hash values: 10 - 3, 100 - 9, 32 - 6, 45 - 6, 58 - 6, 126 - 9, 3 - 3, 29 - 3, 200 - 5, 400 - 10, 0 - 0

Which, of course, is unusable due to the number of collisions. Something more elaborate is needed.

Why state the obvious? Considering that the numbers are so few any elaborate - or rather, "less simple" - algorithm will likely be slower than either the switch statement or (which I prefer) simply searching through an unsigned short/long vector of size eleven positions and using the index of the match.

Why use a vector search?

  1. You can fine-tune it by placing the most often occuring values towards the beginning of the vector.
  2. I assume the purpose is to plug in the hash index into a switch with nice, sequential numbering. In that light it seems wasteful to first use a switch to find the index and then plug it into another switch. Maybe you should consider not using hashing at all and go directly to the final switch?
  3. The switch version of hashing cannot be fine-tuned and, due to the widely differing values, will cause the compiler to generate a binary search tree which will result in a lot of comparisons and conditional/other jumps (especially costly) which take time (I've assumed you've turned to hashing for its speed) and require space.
  4. If you want to speed up the vector search additionally and are using an x86-system you can implement a vector search based on the assembler instructions repne scasw (short)/repne scasd (long) which will be much faster. After a setup time of a few instructions you will find the first entry in one instruction and the last in eleven followed by a few instructions cleanup. This means 5-10 instructions best case and 15-20 worst. This should beat the switch-based hashing in all but maybe one or two cases.
Boathouse answered 9/11, 2010 at 9:28 Comment(0)
N
0

I did a quick check and using the SHA256 hash function and then doing modular division by 13 worked when I tried it in Mathematica. For c++ this function should be in the openssl library. See this post.

If you were doing a lot of hashing and lookup though, modular division is a pretty expensive operation to do repeatedly. There is another way of mapping an n-bit hash function into a i-bit indices. See this post by Michael Mitzenmacher about how to do it with a bit shift operation in C. Hope that helps.

Nucleoplasm answered 9/11, 2010 at 6:17 Comment(0)
F
0

Try the following which maps your n values to unique indices between 0 and 12 (1369%(n+1))%13

Freshman answered 21/9, 2013 at 19:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.