Chained Hash Tables vs. Open-Addressed Hash Tables
Asked Answered
D

5

62

Can somebody explain the main differences between (advantages / disadvantages) the two implementations?

For a library, what implementation is recommended?

Distributive answered 31/3, 2010 at 20:13 Comment(0)
G
95

Wikipedia's article on hash tables gives a distinctly better explanation and overview of different hash table schemes that people have used than I'm able to off the top of my head. In fact you're probably better off reading that article than asking the question here. :)

That said...

A chained hash table indexes into an array of pointers to the heads of linked lists. Each linked list cell has the key for which it was allocated and the value which was inserted for that key. When you want to look up a particular element from its key, the key's hash is used to work out which linked list to follow, and then that particular list is traversed to find the element that you're after. If more than one key in the hash table has the same hash, then you'll have linked lists with more than one element.

The downside of chained hashing is having to follow pointers in order to search linked lists. The upside is that chained hash tables only get linearly slower as the load factor (the ratio of elements in the hash table to the length of the bucket array) increases, even if it rises above 1.

An open-addressing hash table indexes into an array of pointers to pairs of (key, value). You use the key's hash value to work out which slot in the array to look at first. If more than one key in the hash table has the same hash, then you use some scheme to decide on another slot to look in instead. For example, linear probing is where you look at the next slot after the one chosen, and then the next slot after that, and so on until you either find a slot that matches the key you're looking for, or you hit an empty slot (in which case the key must not be there).

Open-addressing is usually faster than chained hashing when the load factor is low because you don't have to follow pointers between list nodes. It gets very, very slow if the load factor approaches 1, because you end up usually having to search through many of the slots in the bucket array before you find either the key that you were looking for or an empty slot. Also, you can never have more elements in the hash table than there are entries in the bucket array.

To deal with the fact that all hash tables at least get slower (and in some cases actually break completely) when their load factor approaches 1, practical hash table implementations make the bucket array larger (by allocating a new bucket array, and copying elements from the old one into the new one, then freeing the old one) when the load factor gets above a certain value (typically about 0.7).

There are lots of variations on all of the above. Again, please see the wikipedia article, it really is quite good.

For a library that is meant to be used by other people, I would strongly recommend experimenting. Since they're generally quite performance-crucial, you're usually best off using somebody else's implementation of a hash table which has already been carefully tuned. There are lots of open-source BSD, LGPL and GPL licensed hash table implementations.

If you're working with GTK, for example, then you'll find that there's a good hash table in GLib.

Generator answered 1/4, 2010 at 0:38 Comment(8)
Excellent explanation. One thing I've recently learned that most summaries neglect to point out is that deletions adversely affect performance in open addressing tables. When you delete you only mark the entry as deleted. When inserting you can re-use a deleted entry, but when searching, you cannot stop on a deleted entry. If you do lots of insertions and deletions, then over time you accumulate deleted entries that count against the load factor. Thus performance degrades to O(n), even if the actual load remains low. If you don't delete, open addressing is great.Fader
@Adrian this is only true if you use that method of marking deletes. If instead you delete the item you are looking for, and then reinsert all the elements in your probing sequence after the deleted item, then deletion will be slower but will not necessarily affect insertion. However if your implementation in prone to clustering then deletion could be quite slow.Sirreverence
@AdrianMcCarthy - I didn't note that because I hadn't thought of it either at the time. Tuning the performance of hash table deletion seems to be kind of subtle, but not at all obviously so until you start thinking about the problem carefully!Generator
@AdrianMcCarthy : very well written comment and one I am working on right now as I tune my own OAHT ( link to srcforge ) I also study google dense. First thing is, you don't need to mark as deleted unconditionally. You only need to mark as deleted "if part of cluster". Secondly -this is a point I intend to investigate soon- you can run a garbage collection "spuriously", like rehash when the load factor gets high. I think it's the google way.Pigmy
"The downside of chained hashing is having to follow pointers in order to search linked lists". Why is that a downside in relation to open addressing? In open addressing you also need to check if a slot is taken and otherwise go to the next one, thus traversing a sequence of slots. I think @AdrianMcCarthy's point on deletions is relevant here as a downside of open addressing, but I'm not convinced that "avoiding traversing a list / following pointers" is an upside in favor of open addressing.Cringe
@Josh: Following pointers in a linked list is cache hostile, which is a performance killer on modern CPUs. Open addressing clumps colliding keys near each other (at least the first few), so walking through keys that collide is much more cache friendly. If you have to check more than a few, they are generally in memory order and thus likely to be prefetched into cache.Fader
Thanks @AdrianMcCarthy yes, that's right. I was hoping to cast light into caching as a a potential benefit (not just that we avoid "having to follow pointers"). However, note that one could store linked lists in an array of contiguous memory, single dimensional or multi dimensional, and open addressing algorithms like double hashing or quadratic probing don't guarantee continuity in memory either. In other words, I'm not sure one could argue OA leads to better caching (if you know of any references I'm highly interested).Cringe
@Cringe With linked list you have to dereference a pointer to find next position. With open address you just do index + ElementSize so it's a memory fetch vs pure arithmetic. Then both systems follow with the actual read of the next element.Lucio
S
4

My understanding (in simple terms) is that both the methods has pros and cons, though most of the libraries use Chaining strategy.

Chaining Method:

Here the hash tables array maps to a linked list of items. This is efficient if the number of collision is fairly small. The worst case scenario is O(n) where n is the number of elements in the table.

Open Addressing with Linear Probe:

Here when the collision occurs, move on to the next index until we find an open spot. So, if the number of collision is low, this is very fast and space efficient. The limitation here is the total number of entries in the table is limited by the size of the array. This is not the case with chaining.

There is another approach which is Chaining with binary search trees. In this approach, when the collision occurs, they are stored in binary search tree instead of linked list. Hence, the worst case scenario here would be O(log n). In practice, this approach is best suited when there is a extremely nonuniform distribution.

Shoreward answered 6/6, 2017 at 6:8 Comment(0)
C
2

Since excellent explanation is given, I'd just add visualizations taken from CLRS for further illustration:

Open Addressing: Open Addressing:

Chaining: Chaining:

Confederate answered 15/3, 2017 at 13:18 Comment(2)
I believe the first figure you put up is a diagram of direct-addressing instead of open-addressing.Puckery
Most of the times, the nodes in the linked list in the separate chaining method only points forward, not backwards.Aviator
A
1

Open addressing vs. separate chaining

Linear probing, double and random hashing are appropriate if the keys are kept as entries in the hashtable itself... doing that is called "open addressing" it is also called "closed hashing"

Another idea: Entries in the hashtable are just pointers to the head of a linked list (“chain”); elements of the linked list contain the keys... this is called "separate chaining" it is also called "open hashing"

Collision resolution becomes easy with separate chaining: just insert a key in its linked list if it is not already there (It is possible to use fancier data structures than linked lists for this; but linked lists work very well in the average case, as we will see) Let’s look at analyzing time costs of these strategies

Source: http://cseweb.ucsd.edu/~kube/cls/100/Lectures/lec16/lec16-25.html

Anthropophagi answered 26/4, 2019 at 0:45 Comment(0)
S
-1

If the number of items that will be inserted in a hash table isn’t known when the table is created, chained hash table is preferable to open addressing.

Increasing the load factor(number of items/table size) causes major performance penalties in open addressed hash tables, but performance degrades only linearly in chained hash tables.

If you are dealing with low memory and want to reduce memory usage, go for open addressing. If you are not worried about memory and want speed, go for chained hash tables.

When in doubt, use chained hash tables. Adding more data than you anticipated won’t cause performance to slow to a crawl.

Silvern answered 12/4, 2016 at 9:55 Comment(1)
Well, if the number of items is not known in advance, the standard approach would be to just grow the table in factors of two. That can be done with either open addressing or chained hash tables. No necessity for using a chained table, nor for driving it into a state where performance becomes linear.Erelia

© 2022 - 2024 — McMap. All rights reserved.