Hash Table lookup - with perfect hash, in C

Asked 7/9, 2011 at 3:38 Answered 7/9, 2011 at 9:56

I have a C-language app where I need to do table lookups.

The entries are strings, All are known at the start of runtime. The table is initialized once, and then looked up many times. The table can change, but it's basically as if the app starts over. I think this means I can use a perfect-hash? It's ok to consume some time for the hashtable initialization, as it happens just once.

There will be between 3 and 100,000 entries, each one unique, and I estimate that 80% of cases will have fewer than 100 entries. A simple naive lookup is "fast enough" in those cases. (== no one is complaining)

However in the cases where there are 10k+ entries, the lookup speed of a naive approach is unacceptable. What's a good approach for delivering good hashtable-based lookup performance for strings in C? Assume I do not have a 3rd-party commercial library like Boost/etc. What hash algorithm should I use? how do I decide?

Polygenesis answered 7/9, 2011 at 3:38 Comment(2)

gnu.org/s/gperf ? – Wilderness 7/9, 2011 at 3:42

Also cmph.sourceforge.net – Obeng 7/9, 2011 at 3:44

Generating a perfect hash is not a simple problem. There's libraries devoted to the task. In this case the most popular one is probably CMPH. I haven't used it though so can't help beyond that. gperf is another tool, but it requires the strings to be known at compile time (you could work around it by compiling a .so and loading, but kind of overkill).

But frankly, I'd at least try to go with a binary search first. Simply sort the array using qsort, then search with bsearch (or roll your own). Both those are part of stdlib.h since C89.

Galvan answered 7/9, 2011 at 6:19 Comment(3)

They're also available in ANSI C ( C89 ). – Transformer 7/9, 2011 at 6:36

Right. Not sure why I looked at the Linux man page when I have a BSD one available. :) – Galvan 7/9, 2011 at 6:50

Good call, thanks Per. I was making the problem more complicated than it needed to be. – Polygenesis 15/9, 2011 at 14:2

If a naive (I assume you mean linear) approach is ok for 100 entries (so 50 comparisons are done on average) then a binary search will be more than sufficient for 100,000 entries (it takes at most 17 comparisons).

So I wouldn't bother with hashes at all but just resort to sorting your string table on startup (e.g. using qsort) and later using a binary search (e.g. using bsearch) to look up entries.

Illimani answered 7/9, 2011 at 6:30 Comment(0)

If the (maximal) table size is known, a plain hashtable with chaining is very easy to implement. Size overhead is only two ints per item. With a reasonable hash function only 1.5 probes per lookup are needed on average, this for a 100% loaded table.

Constructing a perfect hash is only feasible if your data does not change. Once it changes, you'll have to recompute and rehash, which is way more expensive than doing a few extra compares.

Ontogeny answered 7/9, 2011 at 9:56 Comment(0)

Recommended topics

Hot tags