A minimal hash function for C?
Asked Answered
B

7

46

I can't use boost:hash because I have to stick with C and can't use C++.

But, I need to hash a large number (10K to 100k) of tokens strings (5 to 40 bytes length) so that search within those are fastest.

MD5, SHA1 or any long hash function seems too heavy for a simple task, I am not doing cryptography. Plus there is the storage and computing cost.

Therefore my question:

  1. What might be the simplest hash algorithm that will ensure collision prevention in most practical cases.

  2. How many bit to use for the hash value? I am developing for 32 bit systems. Does hash algorithm in Perl/Python use 32 bit hashes too? Or do I have to jump to 64?

  3. Regarding implementation of hash tables in common scripting languages: does the implementation check for collisions or can I avoid that part altogether?

Biting answered 13/4, 2009 at 13:54 Comment(3)
Have you considered using GLib? developer.gnome.org/glib/2.46/glib-Hash-Tables.htmlDupaix
The following page has several implementations of general purpose hash functions implemented in C (and many other languages): partow.net/programming/hashfunctions/index.htmlKandis
See also: hash function for string. I've posted the K&R and GCC C++11 hashes here, as some to test.Inopportune
G
24

You can find a good (and fast) hash function, and an interesting read, at http://www.azillionmonkeys.com/qed/hash.html

The only time you should not check for collisions, is if you use a perfect hash -- a good old fashioned lookup table, like gperf.

Grith answered 13/4, 2009 at 14:4 Comment(2)
I would suggest looking at one that Hsieh's analysis missed: MurmurHash2. en.wikipedia.org/wiki/MurmurHashFallible
@StevenSudit, Murmur3 (current version of Murmur) is actually a bit slower than SuperFastHash by Paul Hsieh. However, SuperFastHash has many collisions on words and names. And it's not as fast as something like FarmHash. So I wouldn't recommend it.Termless
C
13
  1. Here is a nice overview of the most notable known hash functions.

  2. 32bits should work just fine.

  3. You always need to check for collisions, unless you want to write a funny hashtable :)

Cabrilla answered 13/4, 2009 at 14:2 Comment(2)
You don't need to check for collisions if you don't particularly care about which answer you get. Advantage is that you don't have to store the original key in the hash table so you can save a lot of space.Bourbon
Well, such a non-deterministic behavior is what I meant by 'funny'.Cabrilla
H
8

A general hash function for hash table lookup. It specifies Do NOT use for cryptographic purposes, but since you specified that you have no intent for that then you should be ok.

It Included is A Survey of Hash Functions to try out

Hermy answered 13/4, 2009 at 14:0 Comment(0)
M
5

If you're on a posix alike system and sticking to plain C, I would simply use what the system already has to offer. man 3 hcreate offers you all details or you can find an online version here http://linux.die.net/man/3/hcreate

Megrim answered 13/4, 2009 at 16:5 Comment(0)
T
4

There are pretty good hashing functions in a few lines of code, but they are not as fast as optimized functions for specific CPU families. Therefore, "minimal" low-code functions should only be used if performance of hashing is not a primary concern or when reducing space (RAM usage or code size) is more important than speed.

That being said, some of the best simple hashing functions are FNV1a, MurmurOAAT, RSHash and SDBM. The latter two are the fastest of the four. However, SDBM is also easy to memorize, so it's probably the most famous of them.

SDBM is an example of multiplicative hash functions, to which DJB2 and K&Rv2 also belong. Such functions simply multiply current hash value by an odd number and add the next byte: hash = hash * factor + data[i]. SDBM uses number 65599 as a factor, DJB2 – 33, K&Rv2 – 31.

RSHash and FNV1a are better than SDBM, because they are using much larger factors. And MurmurOAAT is even better, because it also uses right shift operation (in addition to "add or xor byte" and multiply). This produces much better avalanching.

Using 64-bit hashes makes sense when you are using all 64 bits or, possibly, if you are developing for 64-bit architecture. If you are using hashes for hash tables, using 64-bit hashes makes little sense, because most likely your hash table won't have billions of slots. But if, for example, you use hashes as fingerprints (checksums), 64-bit hashes are much better than 32-bit.

Following is a comparison of the mentioned hash functions between themselves, as well as relative to some modern optimized functions, like Murmur3 and FarmHash. Some other hash functions, which are mentioned in other answers, are also included. Most of these functions are 32-bit. I also included some 64-bit hashing functions truncated to 32 bits for comparison.

(1) Collisions & Avalanching

                       |        Collisions           |  Avalanching
Hash function          | random 16B |  words + names | words + names
=====================================================================
FNV1a                  |     0.20%  |     45,  0.01% |        98.0
MurmurOAAT             |     0.20%  |     50,  0.01% |       100.0
RSHash                 |     0.20%  |     51,  0.01% |        95.5
SDBM                   |     0.20%  |     53,  0.01% |        93.1
Jenkins OAAT           |     0.19%  |     66,  0.01% |       100.0
K&R v2                 |     0.20%  |    767,  0.12% |        89.8
SuperFastHash          |     0.21%  |    808,  0.13% |       100.0
DJB2                   |     0.20%  |    856,  0.13% |        89.7
XOR_rotate             |     0.19%  |   2156,  0.34% |        89.5
DEKHash                |     0.19%  |   2312,  0.36% |        88.6
Fletcher32             |     0.19%  |   5274,  0.82% |        90.9
Adler32                |    67.44%  | 371448, 58.03% |        73.9
---------------------------------------------------------------------
MurmurHash3_x86_32     |     0.19%  |     42,  0.01% |       100.0
MurmurHash3_x64_32     |     0.20%  |     50,  0.01% |       100.0
farmhash_fingerprint32 |     0.20%  |     42,  0.01% |       100.0
farmhash_fingerprint64 |     0.19%  |     36,  0.01% |       100.0  <-- best
CRC32                  |     0.20%  |     45,  0.01% |        99.8

Adler32 is the only algorithm that performs poorly and produces a lot of collisions for 16-byte random buffers. It is a truly terrible hashing function!

Poor performance on words and names is an indication of poor quality of a hashing algorithm for me. That's why I would not recommend any algorithm which has more than 0.01% collisions on this dataset.

(2) Speed

                       |                   Time, ms
Hash function          | random 16B | random 16MB | words + names 
==================================================================
FNV1a                  |     132.3  |      339.3  |         12.9
MurmurOAAT             |     170.3  |      500.7  |         11.9
RSHash                 |     122.2  |      250.5  |         10.9
SDBM                   |     116.1  |      250.8  |         11.3
Jenkins OAAT           |     159.0  |      417.8  |         11.2
K&R v2                 |     116.1  |      250.2  |         11.1
SuperFastHash          |      71.0  |      127.0  |          9.8
DJB2                   |     169.5  |      251.0  |         10.8
XOR_rotate             |     119.8  |      167.6  |         10.9
DEKHash                |     115.5  |      167.0  |         10.9
Fletcher32             |      79.0  |       52.2  |         18.5
Adler32                |     391.6  |      669.0  |         12.3
------------------------------------------------------------------
MurmurHash3_x86_32     |      93.0  |      124.3  |         10.9
MurmurHash3_x64_32     |      96.3  |       53.6  |         13.5
farmhash_fingerprint32 |      87.2  |       52.2  |          5.4
farmhash_fingerprint64 |      48.8  |       17.3  |          5.7  <-- fastest
CRC32 (bit-by-bit)     |    1189.6  |     2091.1  |         23.0
CRC32 (table-driven)   |     262.0  |      667.3  |         11.1

For timings, I compiled and ran my code on an Apple M1 Pro machine. You will likely get different speeds on Intel x86-64 architecture, especially for FarmHash, which is heavily optimized for Intel instruction sets. But the general picture will likely stay the same.

(3) Code

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <unistd.h>

#define SMALL   16
#define LARGE   (1 << 24)   // size of 16 MiB
#define SEED    0x12345678

uint8_t buffer[LARGE * SMALL];
uint32_t hashes[LARGE];

typedef uint32_t (*hash_function) (const uint8_t*, size_t);

uint32_t DJB2_hash(const uint8_t* buf, size_t size) {
    uint32_t hash = 5381;
    for (size_t i = 0; i < size; i++)
        hash = ((hash << 5) + hash) + buf[i]; /* hash * 33 + byte */
    return hash;
}

uint32_t SDBM_hash(const uint8_t* buf, size_t size) {
    uint32_t hash = 0;
    for (size_t i = 0; i < size; i++)
        hash = (hash << 6) + (hash << 16) - hash + buf[i]; /* hash * 65599 + byte */
    return hash;
}

uint32_t FNV1a(const uint8_t* data, size_t size) {
    uint32_t h = 2166136261UL;
    for (size_t i = 0; i < size; i++) {
        h ^= data[i];
        h *= 16777619;
    }
    return h;
}

uint64_t FNV1a_64(const uint8_t* data, size_t size) {
    uint64_t h = 0xcbf29ce484222325UL;
    for (size_t i = 0; i < size; i++) {
        h ^= data[i];
        h *= 0x00000100000001B3UL;
    }
    return h;
}

uint32_t MurmurOAAT_32(const uint8_t* data, size_t size) {
    // One-byte-at-a-time hash based on Murmur's mix
    uint32_t h = SEED;
    for (size_t i = 0; i < size; i++) {
        h ^= data[i];
        h *= 0x5bd1e995;
        h ^= h >> 15;
    }
    return h;
}

uint32_t KR_v2_hash(const uint8_t* data, size_t size) {
    // a.k.a. Java String hashCode(), a.k.a. BKDR Hash Function
    uint32_t hashval = 0;
    for (size_t i = 0; i < size; i++)
        hashval = 31*hashval + data[i];
    return hashval;
}

uint32_t Jenkins_one_at_a_time_hash(const uint8_t* data, size_t size) {
    // a.k.a. simply "One-at-a-Time Hash"
    uint32_t hash = 0;
    for (size_t i = 0; i < size; i++) {
        hash += data[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

#define POLY_CRC32C 0x82f63b78
#define POLY_CRC32  0xedb88320
uint32_t CRC32b(const uint8_t *data, size_t size) {
    unsigned int crc = 0xFFFFFFFF, mask;
    for (size_t i = 0; i < size; i++) {
        crc ^= data[i];
        for (int j = 7; j >= 0; j--) {
            mask = -(crc & 1);
            crc = (crc >> 1) ^ (POLY_CRC32 & mask);
        }
    }
    return ~crc;
}

inline uint32_t _rotl32(uint32_t x, int32_t bits) {
    return x<<bits | x>>(32-bits);      // C idiom: will be optimized to a single CPU operation
}

uint32_t XOR_rotate(const uint8_t* data, size_t size) { 
    // Source: https://mcmap.net/q/23366/-hash-function-for-string
    uint32_t result = 0x55555555;
    for (size_t i = 0; i < size; i++) {
        result ^= data[i];
        result = _rotl32(result, 5);
    }
    return result;
}

uint32_t Adler32_slow(const uint8_t* data, size_t len) {
    // This is a slow implementation of Adler32, but it doesn't matter because
    // it's a VERY BAD hash function anyway. Don't use it!
    const uint32_t MOD_ADLER = 65521;
    uint32_t a = 1, b = 0;
    for (size_t i = 0; i < len; i++) {
        a = (a + data[i]) % MOD_ADLER;
        b = (b + a) % MOD_ADLER;
    }
    return (b << 16) | a;
}

uint32_t RSHash(const uint8_t* data, size_t size) {
    // Introduced by Robert Sedgewick in his "Algorithms in C" (1990) book.
    // Source: http://www.partow.net/programming/hashfunctions/index.html
    // RSHash seems to outperform JSHash, PJWHash (a.k.a. ELFHash) and APHash from the same source in all respects.
    uint32_t a = 63689, b = 378551;
    uint32_t hash = 0;
    for (uint32_t i = 0; i < size; i++) {
       hash = hash * a + data[i];
       a *= b;
    }
    return hash;
}

uint32_t DEKHash(const uint8_t* str, size_t length) {
    // Introduced by Donald E. Knuth in "The Art Of Computer Programming", Volume 3.
    // Source: http://www.partow.net/programming/hashfunctions/index.html
    unsigned int hash = length;
    for (size_t i = 0; i < length; i++)
       hash = ((hash << 5) ^ (hash >> 27)) ^ str[i];
    return hash;
}

uint32_t Fletcher32(const uint8_t *data_, size_t len) {
    // Note: when `len` (size in bytes) is odd, this function will read past the last byte.
    // However, for C-strings, the next byte will be zero, which is sufficient for this test.
    // In other cases, you have to do zero-padding.
    const uint16_t* data = (const uint16_t*) data_;
    uint32_t c0, c1;
    len = (len + 1) & ~1;      /* Round up len to words */

    /* We similarly solve for n > 0 and n * (n+1) / 2 * (2^16-1) < (2^32-1) here. */
    /* On modern computers, using a 64-bit c0/c1 could allow a group size of 23726746. */
    for (c0 = c1 = 0; len > 0; ) {
        size_t blocklen = len;
        if (blocklen > 360*2) {
            blocklen = 360*2;
        }
        len -= blocklen;
        do {
            c0 = c0 + *data++;
            c1 = c1 + c0;
        } while ((blocklen -= 2));
        c0 = c0 % 65535;
        c1 = c1 % 65535;
    }
    return (c1 << 16 | c0);
}

uint32_t Fletcher32_wrap(const uint8_t* data, size_t size) {
    if (size & 1) {
        // Add zero-padding for odd sizes. (This is, actually, unnecessary for
        // C-strings, because they are terminated by a nul-byte by definition.)
        uint8_t* buf = malloc(size + 1);
        memcpy(buf, data, size);
        buf[size] = 0;
        uint32_t res = Fletcher32(buf, size + 1);
        free(buf);
        return res;
    }
    return Fletcher32(data, size);
}

struct timeval stop, start;
void timing_start() {
    gettimeofday(&start, NULL);
}

void timing_end() {
    gettimeofday(&stop, NULL);
    printf("took %lu us, ", (stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);
}

#define ALGO_COUNT 12
hash_function get_hash_function(int algo) {
    hash_function res;
    switch (algo) {
        case 1:  printf("FNV1a                 "); res = FNV1a; break;
        case 2:  printf("MurmurOAAT            "); res = MurmurOAAT_32; break;
        case 3:  printf("RSHash                "); res = RSHash; break;
        case 4:  printf("SDBM                  "); res = SDBM_hash; break;
        case 5:  printf("Jenkins OAAT          "); res = Jenkins_one_at_a_time_hash; break;
        case 6:  printf("K&R v2                "); res = KR_v2_hash; break;
        case 7:  printf("DJB2                  "); res = DJB2_hash; break;
        case 8:  printf("XOR_rotate            "); res = XOR_rotate; break;
        case 9:  printf("DEKHash               "); res = DEKHash; break;
        case 10: printf("Fletcher32            "); res = Fletcher32_wrap; break;
        case 11: printf("Adler32               "); res = Adler32_slow; break;
        case 12: printf("CRC32 (bit-by-bit)    "); res = CRC32b; break;
        default: printf("ERROR: unknown algo \n"); exit(-1);
    }
    printf("\t");
    return res;
}

void fill_random(uint32_t* buffer, size_t count) {
    for (size_t i = 0; i < count; i++)
        buffer[i] = random();
}

void show_sequence() {
    printf("Generated random sequence:");
    for (int i = 0; i < SMALL; i++) {
        printf(" %02x", buffer[i]);
    }
    printf("\n");
}

void count_collisions(const uint32_t* hashes, size_t count) {
    size_t slots = count * 32;
    uint32_t* cnt_table = calloc(slots, sizeof(uint32_t));
    size_t collisions = 0;
    for (size_t i = 0; i < count; i++) {
        size_t j = hashes[i] % slots;
        while (cnt_table[j] != 0 && cnt_table[j] != hashes[i]) {
            j = (j + 1) % slots;
        }
        if (cnt_table[j] == 0) {
            cnt_table[j] = hashes[i];
        } else {
            collisions++;
        }
    }
    free(cnt_table);
    double share = 100.0*(double)collisions/count;
    printf("%lu collisions out of %lu, %0.2f\n", collisions, count, share);
}

void evaluate_avalanching(const uint32_t* hashes, size_t count) {
    // Looks at hashes of consecutive sorted strings and assesses their
    // similarity score. The lower the total similarity, the worse avalanching.
    uint32_t score = 0;
    char last[9], current[9];
    for (size_t i = 0; i < count; i++) {
        sprintf(current, "%08X", hashes[i]);
        if (i) {
            for (int j = 0; j < 8; j++)
                if (last[j] == current[j]) score++;
        }
        strcpy(last, current);
    }
    float similarity = 100.*score/count/8;
    float avalanching = similarity <= 6.25 ? 100 : (100 - similarity) / (100 - 6.25) * 100;
    printf("score: %u, similarity: %.2f, avalanching: %.1f\n", score, similarity, avalanching);
}

void calc_buffer_hashes(char* type, const uint8_t* buffer, size_t length, size_t count) {
    printf("Calculating hashes of %s buffers...\n", type);
    for (int algo = 1; algo <= ALGO_COUNT; algo++) {
        hash_function func = get_hash_function(algo);
        timing_start();
        for (size_t i = 0; i < count; i++) {
            hashes[i] = func(buffer + length*i, length);
        }
        timing_end();
        count_collisions(hashes, count);
    }
}

void calc_word_hashes(const char* fname, char* buffer) {
#define MAX_WORD_LENGTH 50
    printf("Calculating hashes of English words... (please, provide name of the file with words as first argument)\n");

    char line[MAX_WORD_LENGTH + 1];
    size_t word_count = 0;

    // Open file
    FILE* f = fopen(fname, "r");

    // Read file line by line
    while (fgets(line, sizeof(line), f)) {
        line[strcspn(line, "\n")] = '\0';   // strip newline
        strcpy(buffer + word_count*MAX_WORD_LENGTH, line);
        word_count++;
    }
    fclose(f);

    // Calculate hashes
    for (int algo = 1; algo <= ALGO_COUNT; algo++) {
        hash_function func = get_hash_function(algo);
        timing_start();
        for (size_t i = 0; i < word_count; i++) {
            char* word = buffer + MAX_WORD_LENGTH*i;
            size_t len = strlen(word);
            hashes[i] = func((uint8_t*)word, len);
        }
        timing_end();
        count_collisions(hashes, word_count);
        evaluate_avalanching(hashes, word_count);
    }
}

int main(int argc, char* argv[]) {
    srandom(SEED);
    fill_random((uint32_t*) buffer, LARGE * SMALL / sizeof(uint32_t));
    show_sequence();
    calc_buffer_hashes("small", buffer, SMALL, LARGE);
    calc_buffer_hashes("large", buffer, LARGE, SMALL);
    calc_word_hashes(argv[1], (char*)buffer);
    return 0;
}

How to compile and run:

gcc -std=c99 -O2 minimal_hash.c -o minimal_hash
./minimal_hash words.txt

– where words.txt is a combination of all English words and human names (640143 unique sorted strings).

See full code, including the high performance functions, in this repo.

Termless answered 23/10, 2023 at 4:2 Comment(0)
M
2

xxhash is quite fast and easy option. A simple code would use XXH32 function:

unsigned int XXH32 (const void* input, int len, unsigned int seed);

It is 32 bit hash. Since len is int, for larger data more than 2^31-1 bytes use these:

void*         XXH32_init   (unsigned int seed);
XXH_errorcode XXH32_update (void* state, const void* input, int len);
unsigned int  XXH32_digest (void* state);
Maccarone answered 22/10, 2013 at 8:48 Comment(0)
G
1

Try Adler32 for long strings or Murmur2 for short strings.

Gracia answered 13/4, 2009 at 14:12 Comment(2)
Adler32 is not a very good hash at all. In fact, it's even worse than CRC-32, as a hash. Murmur2, on the other hand, is a very fast hash with excellent distribution and worst-case behavior, so there's no reason to limit its use to short strings. I don't really understand the basis of your advice.Fallible
CRC-32 is actually pretty good as a hash, just slow. But Adler32 is truly terrible!Termless

© 2022 - 2024 — McMap. All rights reserved.