Is it possible to create a Minimal Perfect Hash function without a separate lookup table for a small (<64) set of keys?

Asked 24/4, 2019 at 7:3 Answered 25/4, 2019 at 7:9

I recently read this article Throw away the keys: Easy, Minimal Perfect Hashing about generating a minimal perfect hash table for a known set of keys.

The article seems to assume that you need an intermediate table. Is there any other, simpler way to generate such a function if we assume that the set of keys is small (i.e. < 64).

In my case, I want to map a set of thread ID:s to a unique block of data within an array. The threads are started before the hash function is generated and stay constant during the running time of the program. The exact number of threads vary but stays fixed during the runtime of the program:

unsigned int thread_ids*;
unsigned int thread_count;
struct {
    /* Some thread specific data */
}* ThreadData;

int start_threads () {
    /* Code which starts the threads and allocates the threaddata. */
}

int f(thread_id) {
    /* return unique index into threadData */
}

int main() {
    thread_count = 64; /* This number will be small, e.g. < 64 */
    start_threads();
    ThreadData[f(thread_ids[0])]
}

Treharne answered 24/4, 2019 at 7:3 Comment(12)

Interesting question. The thread ids are not known in advance and can in theory be arbitrary (in practice they are not, but we can't rely on that), so I very much doubt that there is a simpler way than a lookup table. – Appertain 24/4, 2019 at 7:12

I've edited the question to make it clear that the all the thread ID:s are known once the hash function is generated. – Treharne 24/4, 2019 at 7:59

If I understand task correctly you can just basic polynomial function and set some coeffiicients to zero to compress values wide range [1111,2222, 30000, 601234]. For python NumPy can calculate coefficients for you. – Iatry 24/4, 2019 at 8:28

That is indeed a solution, but in what way is it simpler than a lookup table? You could also generate a switch statement if the thread ids are known at compile time, but that is basically a lookup table in code. – Appertain 24/4, 2019 at 8:46

I want to generate the hash function during runtime. So a switch statement doesn't work. – Treharne 24/4, 2019 at 8:58

I've edited the question to make it clear that the exact number of thread ID:s isn't known until runtime but is guaranteed to be small and stay fixed during the lifetime of the program. – Treharne 24/4, 2019 at 9:5

No, but then you probably don't want to call a Python program at run time to find the right coefficients either? My original point was that the thread ids are not known in advance (compile time). As your sample code is C you need to compile the function f, so that is when it happens. – Appertain 24/4, 2019 at 9:5

My opinion is that a simple and efficient way would be to sort the thread ids in an array and use a binary search on that sorted array. – Pistole 24/4, 2019 at 9:31

Why do you think you need a perfect hash? A chained table is simpler, and walking the chain can be hidden in your f() function. (typical average chainlength is ~1.5 for a N=M table) – Tussis 24/4, 2019 at 11:49

Are we provided all <64 thread_ids beforehand or are they provided/updated dynamically? – Dariodariole 25/4, 2019 at 12:46

The threads are created at the start of the program and are fixed during its lifetime. – Treharne 25/4, 2019 at 12:47

Sorry, I still don't understand - what is the exact universe of input and expected output for each call to f? (For example: any 16 bit number, an integer between 0 and 63, etc.) Why can't we provide f the index of the thread in thread_ids rather than the thread id itself? Please forgive my ignorance. – Dariodariole 7/5, 2019 at 10:6

Yes, you can build a minimal perfect hash function (MPHF) at runtime. There are multiple algorithms you can use, but most of them are a bit complex to implement so I can't give you working sample code. Many are implemented in the cmph project.

The most simple one is probably BDZ. On a high level, lookup requires calculating 3 hash functions, and 3 memory accesses. If memory isn't an issue, you only need 2. It supports millions of keys. This algorithm requires a lookup table that is about 1.23 times the number of entries, when using 3 hash functions, and with 2 bits per entry.

There are other algorithms, one I invented myself, the RecSplit algorithm (there's even a research paper now), and there is a C++ implementation, and Java right now. Basically, the algorithms finds a way to split the set into subsets (recursively), until the subset size is 1. You need to remember how you split. The most simple solution is in fact using a lookup table for "how you split", but the table is really small, possibly only 5 integers for 64 keys. The first one to divide into 4 subsets of 16, and 4 to map each subset to a number 0..15.

(I added a second answer if you don't strictly need a minimal perfect hash function, just a perfect hash function. Construction is simpler and lookup is a lot faster, but requires a larger array.)

Madewell answered 24/4, 2019 at 12:17 Comment(0)

You could build a perfect hash as follows, using a brute-force search. For 64 entries, the size of the target array needs to be at least 512 entries, otherwise search won't find an index within reasonable time.

The perfect hash function is then murmur(x + perfectHashIndex) & (TARGET_SIZE - 1)

#include <stdio.h>
#include <stdint.h>
#include <string.h>

static uint64_t murmur64(uint64_t h) {
    h ^= h >> 33;
    h *= UINT64_C(0xff51afd7ed558ccd);
    h ^= h >> 33;
    h *= UINT64_C(0xc4ceb9fe1a85ec53);
    h ^= h >> 33;
    return h;
}

// must be a power of 2
#define TARGET_SIZE 512

static uint64_t findPerfectHashIndex(uint64_t *array, int size) {
    uint64_t used[TARGET_SIZE / 64];
    for (uint64_t index = 0; index < 1000;) {
        memset(used, 0, TARGET_SIZE / 64 * sizeof(uint64_t));
        for (size_t i = 0; i < size; i++) {
            uint64_t x = murmur64(array[i] + index) & (TARGET_SIZE - 1);
            if (((used[x >> 6] >> (x & 63)) & 1) != 0) {
                goto outer;
            }
            used[x >> 6] |= 1UL << (x & 63);
        }
        return index;
        outer:
        index++;
    }
    // not found
    return -1;
}

int main() {
    int size = 64;
    uint64_t ids[size];
    for(int i=0; i<size; i++) ids[i] = 10 * i;
    uint64_t perfectHashIndex = findPerfectHashIndex(ids, size);
    if (perfectHashIndex == -1) {
        printf("perfectHashIndex not found\n");
    } else {
        printf("perfectHashIndex = %lld\n", perfectHashIndex);
        for(int i=0; i<size; i++) {
            printf("  x[%d] = %lld, murmur(x + perfectHashIndex) & (TARGET_SIZE - 1) = %d\n", 
                i, ids[i], murmur64(ids[i] + perfectHashIndex) & (TARGET_SIZE - 1));
        }
    }
}

Madewell answered 25/4, 2019 at 7:9 Comment(2)

Is my understanding correct: the algorithm checks if murmur hashing gives collisions, if none then success, else transform the keys slightly and check again? I like this, thank you – Oust 16/2, 2023 at 7:44

Yes this is the idea. "transform the keys slightly" is adding "index". – Madewell 17/2, 2023 at 8:11

Recommended topics

Hot tags