Why do we use linear probing in hash tables when there is separate chaining linked with lists?
Asked Answered
A

2

35

I recently learned about different methods to deal with collisions in hash tables and saw that the separate chaining with linked lists is always more time efficient than linear probing. For space efficiency, we allocate a predefined memory for linear probing which later on we might not use, but for separate chaining we use memory dynamically.

Is separate chaining with linked list more efficient than linear probing? If so, why do we then use linear probing at all?

Adrial answered 23/5, 2014 at 5:44 Comment(0)
A
58

I'm surprised that you saw chained hashing to be faster than linear probing - in practice, linear probing is typically significantly faster than chaining. This is primarily due to locality of reference, since the accesses performed in linear probing tend to be closer in memory than the accesses performed in chained hashing.

There are other wins in linear probing. For example, insertions into a linear probing hash table don't require any new allocations (unless you're rehashing the table), so in applications like network routers where memory is scarce, it's nice to know that once the table is set up, the elements can be placed into it with no risk of a malloc fail.

One weakness of linear probing is that, with a bad choice of hash function, primary clustering can cause the performance of the table to degrade significantly. While chained hashing can still suffer from bad hash functions, it's less sensitive to elements with nearby hash codes, which don't adversely impact the runtime. Theoretically, linear probing only gives expected O(1) lookups if the hash functions are 5-independent or if there's sufficient entropy in the keys. There are many ways to address this, since as using the Robin Hood hashing technique or hopscotch hashing, both of which have significantly better worst-cases than vanilla linear probing.

The other weakness of linear probing is that its performance significantly degrades as the load factor approaches 1. You can address this either by rehashing periodically or by using the Robin Hood hashing technique described above.

Hope this helps!

Apricot answered 23/5, 2014 at 5:57 Comment(7)
I end up using separate chaining a lot but in a way where the singly-linked list nodes are just storing an index into an array. It's basically an array of 32-bit integers for the bucket heads. The 32-bit integers point to nodes which are likewise just 32-bit integers storing the next node. That avoids the memory allocation per node. I tend to find myself drawn to that solution since it's so predictable from a memory usage standpoint. If we have a hash table with 5,000 buckets and we insert 10,000 elements, the memory overhead is 60,000 bytes (4 bytes per bucket and 4 bytes per element)...His
... with the node next indices being an array parallel to the array of elements. That also comes with the benefit that if the table is copied, it produces optimal spatial locality since the copied table will guarantee that neighbors in a bucket are contiguous. Only pesky thing is the 4 byte overhead per node/bucket, but I find it a worthy trade-off when we don't have to worry about clustering. Anyway, I wanted to jump in just because a list node doesn't have to always lead to fragmented memory or a heap allocation per node.His
How do you efficiently handle deletions? It seems like you’d end up with a lot of unused slots in your elements table and some overhead of finding the next free position.Apricot
It gets a little gross in those cases with low-level C code similar to that of a fixed allocator, where an individual element in the array serves as a union of the element itself and a free index to the next unused space in the array. The hash table stores a free_head index of its own which models a stack without storing a stack (the memory double overs as either an index to the next free element or data for an element). That leaves the vacant spots freed to be reclaimed on subsequent insertions.His
In my case the majority of use cases generally don't need to handle a removal case. It's like just create a hash table, do a boatload of searches, and then discard.His
Simple version of a fixed-sized chunk allocator (didn't test, just wrote for a person to illustrate the idea): paste.ofcode.org/uekw9VLamxtpCebvjrh82Z. The concept is the same except this one's a bit simpler using pointers instead of indices. Same concept except with indices, and yes it does leave holes behind but ones that can be reclaimed very cheaply.His
And a graph to illustrate: imgur.com/a/XaKWM. I use this general strategy a lot for constant-time removals and insertions since a whole lot of my codebase revolves around storing indices to things that shouldn't be invalidated unless the element they're pointing to is specifically removed.His
T
14

Linear probing is actually more memory efficient when the hash table is close to full.

Historically, one had very, very little memory, so every byte mattered (and there are still some cases where memory is very limited).

Why does it use less memory?

Consider what the tables look like: (separate chaining variations as per Wikipedia - there are other variations too, but they typically use more memory)

Linear             Separate chaining #1    Separate chaining #2
probing            List head in table      Pointer in table
|------|           |------|---|            |---|    |------|---|
|Object|           |Object|Ptr|            |Ptr| -> |Object|Ptr|
|------|           |------|---|            |---|    |------|---|
|Object|           |Object|Ptr|            |Ptr| -> |Object|Ptr|
|------|           |------|---|            |---|    |------|---|
| NULL |           | NULL |Ptr|            |Ptr|
|------|           |------|---|            |---|
 .                  .                       .
 .                  .                       .
 .                  .                       .

(Ptr stands for "pointer" - any pointer not pointing to something can be considered NULL)

Separate chaining #1 clearly uses more memory than linear probing (always), as every element in the table is bigger by the size of the pointer.

Separate chaining #2 might have an advantage when there isn't much in the table, but when it gets full, it's going to have roughly an additional 2 pointers floating around for every element.


templatetypedef is probably right about linear probing typically being faster (he's rarely wrong), but it's typically taught that separate chaining is faster, and you see it in major API's (like Java implementations, for example), perhaps because of this believe, to avoid cases when linear probing is much slower (with a few well-selected values, you can quickly get to O(n) performance with linear probing while separate chaining would've still been O(1)), or perhaps for some other reason.

Teeming answered 23/5, 2014 at 8:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.