Minimal Perfect Hash Function
Asked Answered
S

2

16

I have 10^8 random integers in the range [0; 2^63-1]. There are no duplicates. The full list is known at compile-time, but it is just unique random numbers. These numbers never change.

To store one integer explicitly, 8 bytes are required, and each integer has an associated 1-byte value, so explicit storage requires about 860 MB.

So, I want to find a minimal perfect hash function to map each of the 10^8 integers from [0;2^63-1] to [0;10^8-1]. I should find this function only once, the data never changes, and the function may be complicated if necessary. But it should be minimal, perfect, and calculation should be fast. How can I do this?

Might it be possible to find and use some subsequences (if they occur)?

Thanks.

Swagger answered 19/7, 2011 at 6:55 Comment(1)
Full list known at compile-time? My advice then would be to 'manually' assign the numbers yourself, and then write a script to spit out a static declaration of a map in your desired programming language. If it never, ever changes, using a static data structure to perfectly map the values would be your ideal solution. I say 'manually' with inverted commas because you clearly aren't going to do it by hand. See other comments and answers for ideas of what tools can do the assigning for you.Kernan
D
12

Let your computer do the work for you:

http://www.gnu.org/software/gperf/

Quote: "GNU gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only. "

Dorie answered 19/7, 2011 at 6:58 Comment(2)
but for this, CMPH would be better as it was conceived to create minimal perfect hash functions for very large sets of keys.Hargeisa
Thanks, probably I will try both.Swagger
T
5

I'm working on an algorithm and Java implementation that needs less than 1.6 bits per key.

Previously, I have implemented a minimal perfect hash function tool in Java that needs less than 2.0 bits per key.

Other algorithms are implemented in CMPH. For example CHD needs about 2.06 bits per key by default. It can be configured to use less space, but generation is then slower.

Update: There is now a paper about my invention called "RecSplit: Minimal Perfect Hashing via Recursive Splitting"

Turner answered 27/8, 2014 at 18:48 Comment(5)
I'm working on an improved algorithm that needs less than 1.58 bits per entry.Turner
Do you have anywrite up for your code. I was trying to implement it for Long data types but was getting indexoutofbounds errorGaptoothed
@Gaptoothed there is not much documentation currently; you could read the test cases. Maybe create a issue with a test case and exception, so I can look into what the problem could beTurner
I was looking at the LongCollection test case since i need i have long values as input. Once I have the buff ref, can you tell me where are you storing the hash values for the input and how can i read them?Gaptoothed
@Gaptoothed Hash values are not stored. I think it makes sense to ask a new question, and provide the code you have used and what you need. You can then add a comment for me, so I see the question.Turner

© 2022 - 2024 — McMap. All rights reserved.