How does a hash table work?
Asked Answered
R

17

544

I'm looking for an explanation of how a hash table works - in plain English for a simpleton like me!

For example, I know it takes the key, calculates the hash (I am looking for an explanation how) and then performs some kind of modulo to work out where it lies in the array where the value is stored, but that's where my knowledge stops.

Could anyone clarify the process?

Edit: I'm not asking specifically about how hash codes are calculated, but a general overview of how a hash table works.

Reversion answered 8/4, 2009 at 15:48 Comment(3)
Recently, I have written this (en.algoritmy.net/article/50101/Hash-table) article describing several ways, how to store and lookup data, with accent on hash tables and their strategies (separate chaining, linear probing, double hashing)Raisaraise
You could think of a hash table as an extended version of an array, that's not just limited to consecutive integer keys.Bruges
Here is another one: intelligentjava.wordpress.com/2016/10/19/…Mana
C
986

Here's an explanation in layman's terms.

Let's assume you want to fill up a library with books and not just stuff them in there, but you want to be able to easily find them again when you need them.

So, you decide that if the person that wants to read a book knows the title of the book and the exact title to boot, then that's all it should take. With the title, the person, with the aid of the librarian, should be able to find the book easily and quickly.

So, how can you do that? Well, obviously you can keep some kind of list of where you put each book, but then you have the same problem as searching the library, you need to search the list. Granted, the list would be smaller and easier to search, but still you don't want to search sequentially from one end of the library (or list) to the other.

You want something that, with the title of the book, can give you the right spot at once, so all you have to do is just stroll over to the right shelf, and pick up the book.

But how can that be done? Well, with a bit of forethought when you fill up the library and a lot of work when you fill up the library.

Instead of just starting to fill up the library from one end to the other, you devise a clever little method. You take the title of the book, run it through a small computer program, which spits out a shelf number and a slot number on that shelf. This is where you place the book.

The beauty of this program is that later on, when a person comes back in to read the book, you feed the title through the program once more, and get back the same shelf number and slot number that you were originally given, and this is where the book is located.

The program, as others have already mentioned, is called a hash algorithm or hash computation and usually works by taking the data fed into it (the title of the book in this case) and calculates a number from it.

For simplicity, let's say that it just converts each letter and symbol into a number and sums them all up. In reality, it's a lot more complicated than that, but let's leave it at that for now.

The beauty of such an algorithm is that if you feed the same input into it again and again, it will keep spitting out the same number each time.

Ok, so that's basically how a hash table works.

Technical stuff follows.

First, there's the size of the number. Usually, the output of such a hash algorithm is inside a range of some large number, typically much larger than the space you have in your table. For instance, let's say that we have room for exactly one million books in the library. The output of the hash calculation could be in the range of 0 to one billion which is a lot higher.

So, what do we do? We use something called modulus calculation, which basically says that if you counted to the number you wanted (i.e. the one billion number) but wanted to stay inside a much smaller range, each time you hit the limit of that smaller range you started back at 0, but you have to keep track of how far in the big sequence you've come.

Say that the output of the hash algorithm is in the range of 0 to 20 and you get the value 17 from a particular title. If the size of the library is only 7 books, you count 1, 2, 3, 4, 5, 6, and when you get to 7, you start back at 0. Since we need to count 17 times, we have 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, and the final number is 3.

Of course modulus calculation isn't done like that, it's done with division and a remainder. The remainder of dividing 17 by 7 is 3 (7 goes 2 times into 17 at 14 and the difference between 17 and 14 is 3).

Thus, you put the book in slot number 3.

This leads to the next problem. Collisions. Since the algorithm has no way to space out the books so that they fill the library exactly (or the hash table if you will), it will invariably end up calculating a number that has been used before. In the library sense, when you get to the shelf and the slot number you wish to put a book in, there's already a book there.

Various collision handling methods exist, including running the data into yet another calculation to get another spot in the table (double hashing), or simply to find a space close to the one you were given (i.e. right next to the previous book assuming the slot was available also known as linear probing). This would mean that you have some digging to do when you try to find the book later, but it's still better than simply starting at one end of the library.

Finally, at some point, you might want to put more books into the library than the library allows. In other words, you need to build a bigger library. Since the exact spot in the library was calculated using the exact and current size of the library, it goes to follow that if you resize the library you might end up having to find new spots for all the books since the calculation done to find their spots has changed.

I hope this explanation was a bit more down to earth than buckets and functions :)

Clamor answered 8/4, 2009 at 16:33 Comment(9)
Thanks for such a great explanation. Do you know where I can find more technical details regarding how it's implemented in 4.x .Net framework?Shandrashandrydan
No, it is just a number. You would just number each shelf and slot starting at 0 or 1 and increasing by 1 for each slot on that shelf, then continue numbering on the next shelf.Clamor
'Various collision handling methods exist, including running the data into yet another calculation to get another spot in the table' - what do you mean by another calculation? It is just another algorithm? OK, so suppose we use another algorithm that outputs a different number based on the book name. Then later on, if I were to find that book, how would I know which algorithm to use? I'd use first algorithm, second algorithm and so on until I find the book whose title is the one I'm looking for?Grisham
@user107986: "I'd use first algorithm, second algorithm and so on until I find the book whose title is the one I'm looking for?" - yes. Still, it's more common to search as a a sequence of offsets from the original hashed-to position, such as each position thereafter until one's empty (indicating no match/a place you can insert a new book/value) (called linear probing), or at +1, +4, +9, +16 etc. (called quadratic probing)....Lahr
Do hash tables have the potential to leave a lot of empty wasted space where nothing is being stored, forcing the data to take up more room than it would otherwise?Volumeter
Muahaha, it's my secret plan to conquer the world, leave evil answers on Stack Overflow. Oops, not so secret anymore...Clamor
@KyleDelaney: No for closed hashing (where collisions are handled by finding an alternative bucket, which means the memory usage is fixed but you spend more time searching across buckets). For open hashing aka chaining in a pathological case (terrible hash function or inputs deliberately crafted to collide by some adversary/hacker) you could end up with most hash buckets empty, but the total memory usage is no worse - just more pointers NULL instead of indexing into the data usefully.Lahr
Don't the extra null pointers make the memory usage worse? Won't the hash table contain a lot of zeroes that it wouldn't have to contain otherwise?Volumeter
@KyleDelaney: need the "@Tony" thing to get notified of your comments. Seems you're wondering about chaining: say we have three value nodes A{ptrA, valueA}, B{ptrB, valueB}, C{ptrC, valueC}, and a hash table with three buckets [ptr1, ptr2, ptr3]. Regardless of whether there are collisions when inserting, the memory usage is fixed. You may have no collisions: A{NULL, valueA} B{NULL, valueB} C{NULL, valueC} and [&A, &B, &C], or all collisions A{&B, valueA} B{&C, valueB}, C{NULL, valueC} and [NULL, &A, NULL]: are the NULL buckets "wasted"? Kinda, kinda not. Same total memory used.Lahr
Y
110

Usage and Lingo:

  1. Hash tables are used to quickly store and retrieve data (or records).
  2. Records are stored in buckets using hash keys
  3. Hash keys are calculated by applying a hashing algorithm to a chosen value (the key value) contained within the record. This chosen value must be a common value to all the records.
  4. Each bucket can have multiple records which are organized in a particular order.

Real World Example:

Hash & Co., founded in 1803 and lacking any computer technology had a total of 300 filing cabinets to keep the detailed information (the records) for their approximately 30,000 clients. Each file folder were clearly identified with its client number, a unique number from 0 to 29,999.

The filing clerks of that time had to quickly fetch and store client records for the working staff. The staff had decided that it would be more efficient to use a hashing methodology to store and retrieve their records.

To file a client record, filing clerks would use the unique client number written on the folder. Using this client number, they would modulate the hash key by 300 in order to identify the filing cabinet it is contained in. When they opened the filing cabinet they would discover that it contained many folders ordered by client number. After identifying the correct location, they would simply slip it in.

To retrieve a client record, filing clerks would be given a client number on a slip of paper. Using this unique client number (the hash key), they would modulate it by 300 in order to determine which filing cabinet had the clients folder. When they opened the filing cabinet they would discover that it contained many folders ordered by client number. Searching through the records they would quickly find the client folder and retrieve it.

In our real-world example, our buckets are filing cabinets and our records are file folders.


An important thing to remember is that computers (and their algorithms) deal with numbers better than with strings. So accessing a large array using an index is significantly much faster than accessing sequentially.

As Simon has mentioned which I believe to be very important is that the hashing part is to transform a large space (of arbitrary length, usually strings, etc) and mapping it to a small space (of known size, usually numbers) for indexing. This if very important to remember!

So in the example above, the 30,000 possible clients or so are mapped to a smaller space.


The main idea in this is to divide your entire data set into segments as to speed up the actual searching which is usually time consuming. In our example above, each of the 300 filing cabinet would (statistically) contain about 100 records. Searching (regardless the order) through 100 records is much faster than having to deal with 30,000.

You may have noticed that some actually already do this. But instead of devising a hashing methodology to generate a hash key, they will in most cases simply use the first letter of the last name. So if you have 26 filing cabinets each containing a letter from A to Z, you in theory have just segmented your data and enhanced the filing and retrieval process.

Yonkers answered 8/4, 2009 at 17:20 Comment(6)
You describe a specific type of hash table collision avoidance strategy, called variably “open addressing” or “closed addressing” (yes, sad but true) or “chaining”. There’s another type which doesn’t use list buckets but instead stores the items “inline”.Heisel
excellent description. except each filing cabinet would contain, on average, about 100 records (30k records / 300 cabinets = 100). Might be worth an edit.Mcinnis
@TonyD, go to this site sha-1 online and generate a SHA-1 hash for TonyD that you type in the text field. You will end up with a generated value of something that looks like e5dc41578f88877b333c8b31634cf77e4911ed8c. This is nothing more than a large hexadecimal number of 160-bits (20-bytes). You can then use this to determine which bucket (a limited quantity) will be used to store your record.Yonkers
@TonyD, I'm not sure where the term "hash key" is referred in a conflicting matter? If so, please point out the two or more locations. Or are you saying that "we" use the term "hash key" while other sites such as Wikipedia uses "hash values, hash codes, hash sums, or simply hashes"? If so, who cares as long as the term used is consistent within a group or an organization. Programmers often use the "key" term. I would personally argue that another good option would be "hash value". But I would rule out using "hash code, hash sum or simply hashes". Focus on the algorithm and not the words!Yonkers
@TonyD, I've changed the text to "they would module the hash key by 300", hoping it will be cleaner and clearer for everyone. Thanks!Yonkers
@Yonkers this is also so helpful, I've been looking around the internet for an intuitive explanation of hashing and this list of replies to the original question is great!Comptom
W
67

This turns out to be a pretty deep area of theory, but the basic outline is simple.

Essentially, a hash function is just a function that takes things from one space (say strings of arbitrary length) and maps them to a space useful for indexing (unsigned integers, say).

If you only have a small space of things to hash, you might get away with just interpreting those things as integers, and you're done (e.g. 4 byte strings)

Usually, though, you've got a much larger space. If the space of things you allow as keys is bigger than the space of things you are using to index (your uint32's or whatever) then you can't possibly have a unique value for each one. When two or more things hash to the same result, you'll have to handle the redundancy in an appropriate way (this is usually referred to as a collision, and how you handle it or don't will depend a bit on what you are using the hash for).

This implies you want it to be unlikely to have the same result, and you probably also would really like the hash function to be fast.

Balancing these two properties (and a few others) has kept many people busy!

In practice you usually should be able to find a function that is known to work well for your application and use that.

Now to make this work as a hashtable: Imagine you didn't care about memory usage. Then you can create an array as long as your indexing set (all uint32's, for example). As you add something to the table, you hash it's key and look at the array at that index. If there is nothing there, you put your value there. If there is already something there, you add this new entry to a list of things at that address, along with enough information (your original key, or something clever) to find which entry actually belongs to which key.

So as you go a long, every entry in your hashtable (the array) is either empty, or contains one entry, or a list of entries. Retrieving is a simple as indexing into the array, and either returning the value, or walking the list of values and returning the right one.

Of course in practice you typically can't do this, it wastes too much memory. So you do everything based on a sparse array (where the only entries are the ones you actually use, everything else is implicitly null).

There are lots of schemes and tricks to make this work better, but that's the basics.

Winy answered 8/4, 2009 at 16:11 Comment(3)
Sorry, I know this is an old question/answer, but I've been trying to understand this last point you make. A hash table has O(1) time complexity. However, once you use a sparse array, aren't you left needing to do a binary search to find your value? At that point doesn't the time complexity become O(log n)?Accompaniment
@herbrandson: no... a sparse array simply means relatively few indices have been populated with values - you can still index directly to the specific array element for the hash value you've calculated from your key; still, the sparse array implementation Simon describes is only sane in very limited circumstances: when bucket sizes are of the order of memory page sizes (vs. say int keys at 1-in-1000 sparseness and 4k pages = most pages touched), and when the OS treats all-0 pages efficiently (so all-unused-bucket pages don't need backing memory), when address space is plentiful....Lahr
@TonyDelroy - that's true it is oversimplification but the idea was to give a overview of what they are and why, not a practical implementation. The details of the latter are more nuanced, as you nod to in your expansion.Winy
L
67

Lots of answers, but none of them are very visual, and hash tables can easily "click" when visualised.

Hash tables are often implemented as arrays of linked lists. If we imagine a table storing people's names, after a few insertions it might be laid out in memory as below, where ()-enclosed numbers are hash values of the text/name.

bucket#  bucket content / linked list

[0]      --> "sue"(780) --> null
[1]      null
[2]      --> "fred"(42) --> "bill"(9282) --> "jane"(42) --> null
[3]      --> "mary"(73) --> null
[4]      null
[5]      --> "masayuki"(75) --> "sarwar"(105) --> null
[6]      --> "margaret"(2626) --> null
[7]      null
[8]      --> "bob"(308) --> null
[9]      null

A few points:

  • each of the array entries (indices [0], [1]...) is known as a bucket, and starts a - possibly empty - linked list of values (aka elements, in this example - people's names)
  • each value (e.g. "fred" with hash 42) is linked from bucket [hash % number_of_buckets] e.g. 42 % 10 == [2]; % is the modulo operator - the remainder when divided by the number of buckets
  • multiple data values may collide at and be linked from the same bucket, most often because their hash values collide after the modulo operation (e.g. 42 % 10 == [2], and 9282 % 10 == [2]), but occasionally because the hash values are the same (e.g. "fred" and "jane" both shown with hash 42 above)
    • most hash tables handle collisions - with slightly reduced performance but no functional confusion - by comparing the full value (here text) of a value being sought or inserted to each value already in the linked list at the hashed-to bucket

Linked list lengths relate to load factor, not the number of values

If the table size grows, hash tables implemented as above tend to resize themselves (i.e. create a bigger array of buckets, create new/updated linked lists there-from, delete the old array) to keep the ratio of values to buckets (aka load factor) somewhere in the 0.5 to 1.0 range.

Hans gives the actual formula for other load factors in a comment below, but for indicative values: with load factor 1 and a cryptographic strength hash function, 1/e (~36.8%) of buckets will tend to be empty, another 1/e (~36.8%) have one element, 1/(2e) or ~18.4% two elements, 1/(3!e) about 6.1% three elements, 1/(4!e) or ~1.5% four elements, 1/(5!e) ~.3% have five etc.. - the average chain length from non-empty buckets is ~1.58 no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.

How a hash table can associate keys with values

Given a hash table implementation as described above, we can imagine creating a value type such as `struct Value { string name; int age; };`, and equality comparison and hash functions that only look at the `name` field (ignoring age), and then something wonderful happens: we can store `Value` records like `{"sue", 63}` in the table, then later search for "sue" without knowing her age, find the stored value and recover or even update her age - happy birthday Sue - which interestingly doesn't change the hash value so doesn't require that we move Sue's record to another bucket.

When we do this, we're using the hash table as an associative container aka map, and the values it stores can be deemed to consist of a key (the name) and one or more other fields still termed - confusingly - the value (in my example, just the age). A hash table implementation used as a map is known as a hash map.

This contrasts with the example earlier in this answer where we stored discrete values like "sue", which you could think of as being its own key: that kind of usage is known as a hash set.

There are other ways to implement a hash table

Not all hash tables use linked lists (known as separate chaining), but most general purpose ones do, as the main alternative closed hashing (aka open addressing) - particularly with erase operations supported - has less stable performance properties with collision-prone keys/hash functions.


A few words on hash functions

Strong hashing...

A general purpose, worst-case collision-minimising hash function's job is to spray the keys around the hash table buckets effectively at random, while always generating the same hash value for the same key. Even one bit changing anywhere in the key would ideally - randomly - flip about half the bits in the resultant hash value.

This is normally orchestrated with maths too complicated for me to grok. I'll mention one easy-to-understand way - not the most scalable or cache friendly but inherently elegant (like encryption with a one-time pad!) - as I think it helps drive home the desirable qualities mentioned above. Say you were hashing 64-bit doubles - you could create 8 tables each of 256 random numbers (code below), then use each 8-bit/1-byte slice of the double's memory representation to index into a different table, XORing the random numbers you look up. With this approach, it's easy to see that a bit (in the binary digit sense) changing anywhere in the double results in a different random number being looked up in one of the tables, and a totally uncorrelated final value.

// note caveats above: cache unfriendly (SLOW) but strong hashing...
std::size_t random[8][256] = { ...random data... };
auto p = (const std::byte*)&my_double;
size_t hash = random[0][p[0]] ^
              random[1][p[1]] ^
              ... ^
              random[7][p[7]];

Weak but oft-fast hashing...

Many libraries' hashing functions pass integers through unchanged (known as a trivial or identity hash function); it's the other extreme from the strong hashing described above. An identity hash is extremely collision prone in the worst cases, but the hope is that in the fairly common case of integer keys that tend to be incrementing (perhaps with some gaps), they'll map into successive buckets leaving fewer empty than random hashing leaves (our ~36.8% at load factor 1 mentioned earlier), thereby having fewer collisions and fewer longer linked lists of colliding elements than is achieved by random mappings. It's also great to save the time it takes to generate a strong hash, and if keys are looked up in order they'll be found in buckets nearby in memory, improving cache hits. When the keys don't increment nicely, the hope is they'll be random enough they won't need a strong hash function to totally randomise their placement into buckets.

Lahr answered 1/6, 2015 at 6:59 Comment(20)
Allow me to just say: fantastic answer.Bonhomie
@Tony Delroy Thanks for the amazing answer. I still have one open point in my mind though. You say that even if there are 100 million buckets,lookup time would be O(1) with load factor 1 and a cryptographic strength hash function. But what about finding the right bucket in 100 million? Even if we have all the buckets sorted,isn't it O(log100.000.000)? How can finding the bucket be O(1) ?Bankhead
@selman: your question doesn't provide many details to explain why you think it might be O(log100,000,000), but you do say "even if we have all the buckets sorted" - keep in mind that values in hash table buckets are never "sorted" in the usual sense: which value appears in which bucket is determined by applying the hash function to the key. Thinking the complexity is O(log100,000,000) implies you imagine doing a binary search through sorted buckets, but that's not how hashing works. Maybe read a few of the other answers and see if it starts to make more sense.Lahr
@TonyDelroy Indeed,"sorted buckets" are the best-case scenario that I imagine. Hence O(log100,000,000) . But if this is not the case,how can the application find related bucket among millions? Does hash function generate a memory location somehow?Bankhead
@selman: yes... the hash function's job is to decide which specific bucket (your "memory location") a given key belongs to, and then only the elements associated with that bucket need to be searched. In most cases, the number of such elements can be proven to be - on average - a small constant, hence O(1) amortised insertion, lookup and erasure. The hash function generally does that by producing a number that can be modulo-ed or AND-ed to form an index into the array of buckets. If there's a collision, various secondary storage/search techniques are used (in answer: linked lists).Lahr
@TonyDelroy Ok, but.... You're talking about "forming an index into the array of buckets". So if this array has 100.00.000 buckets,even with an index,reaching out to the specific bucket has a cost right? How can lookup cost could be O(1) in an array of 100.00.000 buckets?Bankhead
@selman: because computer memory allows constant time "random access": if you can calculate a memory address you can retrieve the memory contents without having to have accessed memory in other parts of the array. So, whether you access the first bucket, the last bucket, or a bucket anywhere in between, it will have the same performance characteristics (loosely, take the same amount of time, albeit subject to CPU L1/L2/L3 memory caching impacts but they only work to help you quickly re-access recently accessed or coincidentally nearby buckets, and can be ignored for big-O analysis).Lahr
The average, as well as the most likely, number of elements in a bucket should be the load factor. Why do you say the average is 2 for load factor 1?Salt
Just want to add that with a load factor of a and n buckets, each bucket has a binomial probability {{an}\choose k}1/n^k(1-1/n)^{an-k} of having k elements. The probability approaches the Poisson distribution of e^{-a}a^k/k! as n approaches infinity as a and k are held fixed. This gives the numerical percentage values of average number of elements in each bucket.Salt
@Hans: when you search for elements that are in the table, you never hash to an empty bucket: rather you hash to a bucket that has at least one element chained therefrom, and possibly many more. The average length of this chain is more than the load factor as you're only considering non-empty buckets, and if memory serves (I wrote this answer a long time ago) is 2 at load factor 1, regardless of the number of buckets; the important thing is that this allows the O(1) efficiency, rather than decaying as N increases. Thanks for mentioning the actual formula as a function of load factor.Lahr
The reasoning is incorrect. The average automatically takes 0 length into account, as 0 times anything is 0 and is in effect uncounted. By definition, the sum of all bucket lengths is the total number of elements an, a constant over all possibililties. Divide that by the number of buckets n gives the average bucket length a, the load factor. Maybe you are thinking about the total average run time including computing the key which takes O(1) time and searching a bucket with time proportional to the bucket length. That adds up to O(1+a).Salt
Let's go back a bit. "The average, as well as the most likely, number of elements in a bucket should be the load factor" - "equally most" is more accurate: ~=36.8% of buckets will have a chain length matching the load factor, as many buckets will be empty. Still, this just looks at averages across the entire table. Much more importantly for consistent performance, the individual collision chains in the table still have the same length probabilities and average length as table size increases.Lahr
If say larger tables (with the same load factor) tended to have more empty buckets, and more of the collision chains were longer, then the constant time properties would be compromised. This aspect is what my comments/explanation are about, not the whole-table average which is not enough to guarantee O(1) performance if collision chains tended to get longer with table size.Lahr
Tony, you should put @Salt before your comment directed to me, otherwise I will not be alerted. I only see your comment as I want to add one of my own. I do not understand what your phrase in quotation mark "equally most" means. The probability of length approaches the Poisson distribution I gave as the table length increases to infinity. Do you agree that the average bucket length equals to the load factor? It is a mathematical fact. If that is not what you intend to state, it would be better to edit that statement in your answer.Salt
I originally came to add the following comment. As I have stated before, the average bucket length equals to the load factor by definition. This quantity does not depend on thus does not reflect the specific distribution of the hashing function, which strives to approach the uniform distribution. The standard deviation reflects the dispersion of the distribution. It is the square root of the load factor. In this special case of the load factor being 1, 92% of the buckets have length 2 or less, 98% have length 3 or less. The best way to see this is to plot the cumulative Poisson distribution.Salt
@Hans: re "equally most" - I'm just saying that having the "number of elements in a bucket should be the load factor" is equally as likely as having an empty bucket. Anyway, I reworded what I wanted to communicate about chain lengths in the answer - and corrected the value which I'd misremembered (~1.58 not 2) - hopefully it's unambiguous now. Thanks also for the extra details about stddev / Poisson distributions - I imagine someone will find that useful. CheersLahr
You can only say the phrase "the number of elements in a bucket closest to the load factor" since the load factor a is a fraction while the bucket length is an integer. It is not true that the number of elements in a bucket closest to the load factor a is equally likely as having an empty bucket, as you can tell from the Poisson distribution with the former being e^{-a}a and the latter e^{-a}. They are only equal when a=1. I agree that the average bucket length conditioned on nonempty buckets is a better measure of efficiency. It is a/(1-e^{-a}) and is approximately 1.58 for a=1.Salt
@Hans: I've been explicitly discussing load factor 1 for the statistics in this answer, so seems we're on the same page.Lahr
@TonyDelroy I thought keys are also stored along with the values, hash and next pointer. Please correct if I am wrong.Bussard
@PrashantShubham: a hash table needn't have distinct keys and values - see (here). The hash needn't be stored: it can be regenerated from the key, and once searching a bucket byte-by-byte key comparisons are often used, but some hash tables do store the hash so they can a) quickly differentiate between hash value collisions and bucket collisions (fred & jane vs bill in my answer), and b) not have to invoke the hash function again on every key when resizing the hash table. The disadvantage is the extra memory needed to store the hashes.Lahr
M
24

You guys are very close to explaining this fully, but missing a couple things. The hashtable is just an array. The array itself will contain something in each slot. At a minimum you will store the hashvalue or the value itself in this slot. In addition to this you could also store a linked/chained list of values that have collided on this slot, or you could use the open addressing method. You can also store a pointer or pointers to other data you want to retrieve out of this slot.

It's important to note that the hashvalue itself generally does not indicate the slot into which to place the value. For example, a hashvalue might be a negative integer value. Obviously a negative number cannot point to an array location. Additionally, hash values will tend to many times be larger numbers than the slots available. Thus another calculation needs to be performed by the hashtable itself to figure out which slot the value should go into. This is done with a modulus math operation like:

uint slotIndex = hashValue % hashTableSize;

This value is the slot the value will go into. In open addressing, if the slot is already filled with another hashvalue and/or other data, the modulus operation will be run once again to find the next slot:

slotIndex = (remainder + 1) % hashTableSize;

I suppose there may be other more advanced methods for determining slot index, but this is the common one I've seen... would be interested in any others that perform better.

With the modulus method, if you have a table of say size 1000, any hashvalue that is between 1 and 1000 will go into the corresponding slot. Any Negative values, and any values greater than 1000 will be potentially colliding slot values. The chances of that happening depend both on your hashing method, as well as how many total items you add to the hash table. Generally, it's best practice to make the size of the hashtable such that the total number of values added to it is only equal to about 70% of its size. If your hash function does a good job of even distribution, you will generally encounter very few to no bucket/slot collisions and it will perform very quickly for both lookup and write operations. If the total number of values to add is not known in advance, make a good guesstimate using whatever means, and then resize your hashtable once the number of elements added to it reaches 70% of capacity.

I hope this has helped.

PS - In C# the GetHashCode() method is pretty slow and results in actual value collisions under a lot of conditions I've tested. For some real fun, build your own hashfunction and try to get it to NEVER collide on the specific data you are hashing, run faster than GetHashCode, and have a fairly even distribution. I've done this using long instead of int size hashcode values and it's worked quite well on up to 32 million entires hashvalues in the hashtable with 0 collisions. Unfortunately I can't share the code as it belongs to my employer... but I can reveal it is possible for certain data domains. When you can achieve this, the hashtable is VERY fast. :)

Moulton answered 15/5, 2010 at 1:41 Comment(5)
i know the post is pretty old but can someone explain what (remainder + 1) means hereCoupon
@Coupon remainder refers to the result of the original modulo calculation, and we add 1 to it in order to find the next available slot.Ankylosis
"The array itself will contain something in each slot. At a minimum you will store the hashvalue or the value itself in this slot." - it's common for "slots" (buckets) to store no value at all; open addressing implementations often store either NULL or a pointer to the first node in a linked list - with no value directly in the slot/bucket. "would be interested in any others" - the "+1" you illustrate is called linear probing, oft-better-performing: quadratic probing. "generally encounter very few to no bucket/slot collisions" - @ 70% capacity, ~12% slots w/ 2 values, ~3% 3....Lahr
"I've done this using long instead of int size hashcode values and it's worked quite well on up to 32 million entires hashvalues in the hashtable with 0 collisions." - this simply isn't possible in the general case where the values of keys are effectively random in a much larger range than the number of buckets. Note that having distinct hash values is often easy enough (and your talk of long hash values implies that's what you've achieved), but ensuring they don't collide in the hash table after the mod/% operation is not (in the general case).Lahr
(Avoiding all collisions is known as perfect hashing. In general it's practical for a few hundred or thousand keys that are known in advance - gperf is an example of a tool to calculate such a hash function. You can also write your own in very limited circumstances - e.g. if your keys are pointers to objects from your own memory pool which is kept fairly full, with each pointer a fixed distance apart, you can divide the pointers by that distance and effectively have an index into a slightly-sparse array, avoiding collisions.)Lahr
H
20

This is how it works in my understanding:

Here's an example: picture the entire table as a series of buckets. Suppose you have an implementation with alpha-numeric hash-codes and have one bucket for each letter of the alphabet. This implementation puts each item whose hash code begins with a particular letter in the corresponding bucket.

Let's say you have 200 objects, but only 15 of them have hash codes that begin with the letter 'B.' The hash table would only need to look up and search through the 15 objects in the 'B' bucket, rather than all 200 objects.

As far as calculating the hash code, there is nothing magical about it. The goal is just to have different objects return different codes and for equal objects to return equal codes. You could write a class that always returns the same integer as a hash-code for all instances, but you would essentially destroy the usefulness of a hash-table, as it would just become one giant bucket.

Harsh answered 8/4, 2009 at 16:2 Comment(0)
R
13

Short and sweet:

A hash table wraps up an array, lets call it internalArray. Items are inserted into the array in this way:

let insert key value =
    internalArray[hash(key) % internalArray.Length] <- (key, value)
    //oversimplified for educational purposes

Sometimes two keys will hash to the same index in the array, and you want to keep both values. I like to store both values in the same index, which is simple to code by making internalArray an array of linked lists:

let insert key value =
    internalArray[hash(key) % internalArray.Length].AddLast(key, value)

So, if I wanted to retrieve an item out of my hash table, I could write:

let get key =
    let linkedList = internalArray[hash(key) % internalArray.Length]
    for (testKey, value) in linkedList
        if (testKey = key) then return value
    return null

Delete operations are just as simple to write. As you can tell, inserts, lookups, and removal from our array of linked lists is nearly O(1).

When our internalArray gets too full, maybe at around 85% capacity, we can resize the internal array and move all of the items from the old array into the new array.

Richardricharda answered 8/4, 2009 at 17:24 Comment(0)
H
12

It's even simpler than that.

A hashtable is nothing more than an array (usually sparse one) of vectors which contain key/value pairs. The maximum size of this array is typically smaller than the number of items in the set of possible values for the type of data being stored in the hashtable.

The hash algorithm is used to generate an index into that array based on the values of the item that will be stored in the array.

This is where storing vectors of key/value pairs in the array come in. Because the set of values that can be indexes in the array is typically smaller than the number of all possible values that the type can have, it is possible that your hash algorithm is going to generate the same value for two separate keys. A good hash algorithm will prevent this as much as possible (which is why it is relegated to the type usually because it has specific information which a general hash algorithm can't possibly know), but it's impossible to prevent.

Because of this, you can have multiple keys that will generate the same hash code. When that happens, the items in the vector are iterated through, and a direct comparison is done between the key in the vector and the key that is being looked up. If it is found, great and the value associated with the key is returned, otherwise, nothing is returned.

Hoodoo answered 8/4, 2009 at 16:4 Comment(0)
S
11

You take a bunch of things, and an array.

For each thing, you make up an index for it, called a hash. The important thing about the hash is that it 'scatter' a lot; you don't want two similar things to have similar hashes.

You put your things into the array at position indicated by the hash. More than one thing can wind up at a given hash, so you store the things in arrays or something else appropriate, which we generally call a bucket.

When you're looking things up in the hash, you go through the same steps, figuring out the hash value, then seeing what's in the bucket at that location and checking whether it's what you're looking for.

When your hashing is working well and your array is big enough, there will only be a few things at most at any particular index in the array, so you won't have to look at very much.

For bonus points, make it so that when your hash table is accessed, it moves the thing found (if any) to the beginning of the bucket, so next time it's the first thing checked.

Sexy answered 8/4, 2009 at 16:22 Comment(1)
thanks for the last point which everyone else has missed mentioningFormative
A
6

The Basic Idea

Why do people use dressers to store their clothing? Besides looking trendy and stylish, they have the advantage that every article of clothing has a place where it's supposed to be. If you're looking for a pair of socks, you just check the sock drawer. If you're looking for a shirt, you check the drawer that has your shirts in it. It doesn't matter, when you're looking for socks, how many shirts you have or how many pairs of pants you own, since you don't need to look at them. You just look in the sock drawer and expect to find socks there.

At a high level, a hash table is a way of storing things that is (kinda sorta ish) like a dresser for clothing. The basic idea is the following:

  • You get some number of locations (drawers) where items can be stored.
  • You come up with some rule that tells you which location (drawer) each item belongs.
  • When you need to find something, you use that rule to determine which drawer to look into.

The advantage of a system like this is that, assuming your rule isn't too complicated and you have an appropriate number of drawers, you can find what you're looking for pretty quickly by simply looking in the right place.

When you're putting your clothes away, the "rule" you use might be something like "socks go in the top left drawer, and shirts go in the large middle drawer, etc." When you're storing more abstract data, though, we use something called a hash function to do this for us.

A reasonable way to think about a hash function is as a black box. You put data in one side, and a number called the hash code comes out of the other. Schematically, it looks something like this:

              +---------+
            |\|   hash  |/| --> hash code
   data --> |/| function|\|
              +---------+

All hash functions are deterministic: if you put the same data into the function multiple times, you'll always get the same value coming out the other side. And a good hash function should look more or less random: small changes to the input data should give wildly different hash codes. For example, the hash codes for the string "pudu" and for the string "kudu" will likely be wildly different from one another. (Then again, it's possible that they're the same. After all, if a hash function's outputs should look more or less random, there's a chance we get the same hash code twice.)

How exactly do you build a hash function? For now, let's go with "decent people shouldn't think too much about that." Mathematicians have worked out better and worse ways to design hash functions, but for our purposes we don't really need to worry too much about the internals. It's plenty good to just think of a hash function as a function that's

  • deterministic (equal inputs give equal outputs), but
  • looks random (it's hard to predict one hash code given another).

Once we have a hash function, we can build a very simple hash table. We'll make an array of "buckets," which you can think of as being analogous to drawers in our dresser. To store an item in the hash table, we'll compute the hash code of the object and use it as an index in the table, which is analogous to "pick which drawer this item goes in." Then, we put that data item inside the bucket at that index. If that bucket was empty, great! We can put the item there. If that bucket is full, we have some choices of what we can do. A simple approach (called chained hashing) is to treat each bucket as a list of items, the same way that your sock drawer might store multiple socks, and then just add the item to the list at that index.

To look something up in a hash table, we use basically the same procedure. We begin by computing the hash code for the item to look up, which tells us which bucket (drawer) to look in. If the item is in the table, it has to be in that bucket. Then, we just look at all the items in the bucket and see if our item is in there.

What's the advantage of doing things this way? Well, assuming we have a large number of buckets, we'd expect that most buckets won't have too many things in them. After all, our hash function kinda sorta ish looks like it has random outputs, so the items are distributed kinda sorta ish evenly across all the buckets. In fact, if we formalize the notion of "our hash function looks kinda random," we can prove that the expected number of items in each bucket is the ratio of the total number of items to the total number of buckets. Therefore, we can find the items we're looking for without having to do too much work.

The Details

Explaining how "a hash table" works is a bit tricky because there are many flavors of hash tables. This next section talks about a few general implementation details common to all hash tables, plus some specifics of how different styles of hash tables work.

A first question that comes up is how you turn a hash code into a table slot index. In the above discussion, I just said "use the hash code as an index," but that's actually not a very good idea. In most programming languages, hash codes work out to 32-bit or 64-bit integers, and you aren't going to be able to use those directly as bucket indices. Instead, a common strategy is to make an array of buckets of some size m, compute the (full 32- or 64-bit) hash codes for your items, then mod them by the size of the table to get an index between 0 and m-1, inclusive. The use of modulus works well here because it's decently fast and does a decent job spreading the full range of hash codes across a smaller range.

(You sometimes see bitwise operators used here. If your table has a size that's a power of two, say, 2k, then computing the bitwise AND of the hash code and then umber 2k - 1 is equivalent to computing a modulus, and it's significantly faster.)

The next question is how to pick the right number of buckets. If you pick too many buckets, then most buckets will be empty or have few elements (good for speed - you only have to check a few items per bucket), but you'll use a bunch of space simply storing the buckets (not so great, though maybe you can afford it). The flip side of this holds true as well - if you have too few buckets, then you'll have more elements per bucket on average, making lookups take longer, but you'll use less memory.

A good compromise is to dynamically change the number of buckets over the lifetime of the hash table. The load factor of a hash table, typically denoted α, is the ratio of the number of elements to the number of buckets. Most hash tables pick some maximum load factor. Once the load factor crosses this limit, the hash table increases its number of slots (say, by doubling), then redistributes the elements from the old table into the new one. This is called rehashing. Assuming the maximum load factor in the table is a constant, this ensures that, assuming you have a good hash function, the expected cost of doing a lookup remains O(1). Insertions now have an amortized expected cost of O(1) because of the cost of periodically rebuilding the table, as is the case with deletions. (Deletions can similarly compact the table if the load factor gets too small.)

Hashing Strategies

Up to this point, we've been talking about chained hashing, which is one of many different strategies for building a hash table. As a reminder, chained hashing kinda sorta looks like a clothing dresser - each bucket (drawer) can hold multiple items, and when you do a lookup you check all of those items.

However, this isn't the only way to build a hash table. There's another family of hash tables that use a strategy called open addressing. The basic idea behind open addressing is to store an array of slots, where each slot can either be empty or hold exactly one item.

In open addressing, when you perform an insertion, as before, you jump to some slot whose index depends on the hash code computed. If that slot is free, great! You put the item there, and you're done. But what if the slot is already full? In that case, you use some secondary strategy to find a different free slot in which to store the item. The most common strategy for doing this uses an approach called linear probing. In linear probing, if the slot you want is already full, you simply shift to the next slot in the table. If that slot is empty, great! You can put the item there. But if that slot is full, you then move to the next slot in the table, etc. (If you hit the end of the table, just wrap back around to the beginning).

Linear probing is a surprisingly fast way to build a hash table. CPU caches are optimized for locality of reference, so memory lookups in adjacent memory locations tend to be much faster than memory lookups in scattered locations. Since a linear probing insertion or deletion works by hitting some array slot and then walking linearly forward, it results in few cache misses and ends up being a lot faster than what the theory normally predicts. (And it happens to be the case that the theory predicts it's going to be very fast!)

Another strategy that's become popular recently is cuckoo hashing. I like to think of cuckoo hashing as the "Frozen" of hash tables. Instead of having one hash table and one hash function, we have two hash tables and two hash functions. Each item can be in exactly one of two places - it's either in the location in the first table given by the first hash function, or it's in the location in the second table given by the second hash function. This means that lookups are worst-case efficient, since you only have to check two spots to see if something is in the table.

Insertions in cuckoo hashing use a different strategy than before. We start off by seeing if either of the two slots that could hold the item are free. If so, great! We just put the item there. But if that doesn't work, then we pick one of the slots, put the item there, and kick out the item that used to be there. That item has to go somewhere, so we try putting it in the other table at the appropriate slot. If that works, great! If not, we kick an item out of that table and try inserting it into the other table. This process continues until everything comes to rest, or we find ourselves trapped in a cycle. (That latter case is rare, and if it happens we have a bunch of options, like "put it in a secondary hash table" or "choose new hash functions and rebuild the tables.")

There are many improvements possible for cuckoo hashing, such as using multiple tables, letting each slot hold multiple items, and making a "stash" that holds items that can't fit anywhere else, and this is an active area of research!

Then there are hybrid approaches. Hopscotch hashing is a mix between open addressing and chained hashing that can be thought of as taking a chained hash table and storing each item in each bucket in a slot near where the item wants to go. This strategy plays well with multithreading. The Swiss table uses the fact that some processors can perform multiple operations in parallel with a single instruction to speed up a linear probing table. Extendible hashing is designed for databases and file systems and uses a mix of a trie and a chained hash table to dynamically increase bucket sizes as individual buckets get loaded. Robin Hood hashing is a variant of linear probing in which items can be moved after being inserted to reduce the variance in how far from home each element can live.

Further Reading

For more information about the basics of hash tables, check out these lecture slides on chained hashing and these follow-up slides on linear probing and Robin Hood hashing. You can learn more about cuckoo hashing here and about theoretical properties of hash functions here.

Aurify answered 11/9, 2020 at 20:48 Comment(0)
R
4

All of the answers so far are good, and get at different aspects of how a hashtable works. Here is a simple example that might be helpful. Lets say we want to store some items with lower case alphabetic strings as a keys.

As simon explained, the hash function is used to map from a large space to a small space. A simple, naive implementation of a hash function for our example could take the first letter of the string, and map it to an integer, so "alligator" has a hash code of 0, "bee" has a hash code of 1, "zebra" would be 25, etc.

Next we have an array of 26 buckets (could be ArrayLists in Java), and we put the item in the bucket that matches the hash code of our key. If we have more than one item that has a key that begins with the same letter, they will have the same hash code, so would all go in the bucket for that hash code so a linear search would have to be made in the bucket to find a particular item.

In our example, if we just had a few dozen items with keys spanning the alphabet, it would work very well. However, if we had a million items or all the keys all started with 'a' or 'b', then our hash table would not be ideal. To get better performance, we would need a different hash function and/or more buckets.

Readus answered 8/4, 2009 at 16:41 Comment(0)
A
3

Here's another way to look at it.

I assume you understand the concept of an array A. That's something that supports the operation of indexing, where you can get to the Ith element, A[I], in one step, no matter how large A is.

So, for example, if you want to store information about a group of people who all happen to have different ages, a simple way would be to have an array that is large enough, and use each person's age as an index into the array. Thay way, you could have one-step access to any person's information.

But of course there could be more than one person with the same age, so what you put in the array at each entry is a list of all the people who have that age. So you can get to an individual person's information in one step plus a little bit of search in that list (called a "bucket"). It only slows down if there are so many people that the buckets get big. Then you need a larger array, and some other way to get more identifying information about the person, like the first few letters of their surname, instead of using age.

That's the basic idea. Instead of using age, any function of the person that produces a good spread of values can be used. That's the hash function. Like you could take every third bit of the ASCII representation of the person's name, scrambled in some order. All that matters is that you don't want too many people to hash to the same bucket, because the speed depends on the buckets remaining small.

Avar answered 8/4, 2009 at 17:44 Comment(0)
R
3

A hash table totally works on the fact that practical computation follows random access machine model i.e. value at any address in memory can be accessed in O(1) time or constant time.

So, if I have a universe of keys (set of all possible keys that I can use in a application, e.g. roll no. for student, if it's 4 digit then this universe is a set of numbers from 1 to 9999), and a way to map them to a finite set of numbers of size I can allocate memory in my system, theoretically my hash table is ready.

Generally, in applications the size of universe of keys is very large than number of elements I want to add to the hash table(I don't wanna waste a 1 GB memory to hash ,say, 10000 or 100000 integer values because they are 32 bit long in binary reprsentaion). So, we use this hashing. It's sort of a mixing kind of "mathematical" operation, which maps my large universe to a small set of values that I can accomodate in memory. In practical cases, often space of a hash table is of the same "order"(big-O) as the (number of elements *size of each element), So, we don't waste much memory.

Now, a large set mapped to a small set, mapping must be many-to-one. So, different keys will be alloted the same space(?? not fair). There are a few ways to handle this, I just know the popular two of them:

  • Use the space that was to be allocated to the value as a reference to a linked list. This linked list will store one or more values, that come to reside in same slot in many to one mapping. The linked list also contains keys to help someone who comes searching. It's like many people in same apartment, when a delivery-man comes, he goes to the room and asks specifically for the guy.
  • Use a double hash function in an array which gives the same sequence of values every time rather than a single value. When I go to store a value, I see whether the required memory location is free or occupied. If it's free, I can store my value there, if it's occupied I take next value from the sequence and so on until I find a free location and I store my value there. When searching or retreiving the value, I go back on same path as given by the sequence and at each location ask for the vaue if it's there until I find it or search all possible locations in the array.

Introduction to Algorithms by CLRS provides a very good insight on the topic.

Resigned answered 12/6, 2015 at 5:19 Comment(0)
M
3

Direct address table

To understand a hash table, the direct address table is the first concept we should understand. direct-address table

The direct address table uses the key directly as an index to a slot in an array. The size of universe keys is equal to the size of the array. It is really fast to access this key in O(1) time because an array supports random access operations.

However, there are four considerations before implementing an direct address table:

  1. To be an valid array index, the keys should be integers
  2. The universe of the keys is fairly small, otherwise, we will need a giant array.
  3. Not two different keys are mapped to the same slot in the array
  4. The length of the universe keys equal to the length of the array

In fact, not a lot of situations in real life fit the above requirements, so a hash table comes to the rescue

Hash table

Instead of using the key directly, a hash table first applies a mathematical hash function to consistently convert any arbitrary key data to a number, then using that hash result as the key.

The length of the universe keys can be large than the length of the array, which means two different keys can be hashed to the same index(called a hash collision)?

Actually, there are a few different strategies to deal with it. Here is a common solution: instead of storing the actual values in the array, we store a pointer to a linked list holding the values for all the keys that hash to that index.

If you still have interests to know to how implement a hashmap from scratch, please read the following post

enter image description here

Matron answered 10/4, 2021 at 7:41 Comment(0)
P
2

How the hash is computed does usually not depend on the hashtable, but on the items added to it. In frameworks/base class libraries such as .net and Java, each object has a GetHashCode() (or similar) method returning a hash code for this object. The ideal hash code algorithm and the exact implementation depends on the data represented by in the object.

Pegeen answered 8/4, 2009 at 15:52 Comment(0)
H
0

For all those looking for programming parlance, here is how it works. Internal implementation of advanced hashtables has many intricacies and optimisations for storage allocation/deallocation and search, but top-level idea will be very much the same.

(void) addValue : (object) value
{
   int bucket = calculate_bucket_from_val(value);
   if (bucket) 
   {
       //do nothing, just overwrite
   }
   else   //create bucket
   {
      create_extra_space_for_bucket();
   }
   put_value_into_bucket(bucket,value);
}

(bool) exists : (object) value
{
   int bucket = calculate_bucket_from_val(value);
   return bucket;
}

where calculate_bucket_from_val() is the hashing function where all the uniqueness magic must happen.

The rule of thumb is: For a given value to be inserted, bucket must be UNIQUE & DERIVABLE FROM THE VALUE that it is supposed to STORE.

Bucket is any space where the values are stored - for here I have kept it int as an array index, but it maybe a memory location as well.

Hawfinch answered 7/10, 2015 at 11:11 Comment(1)
"rule of thumb is: For a given value to be inserted, bucket must be UNIQUE & DERIVABLE FROM THE VALUE that it is supposed to STORE." - this describes a perfect hash function, which is usually only possible for a few hundred or thousand values known at compile time. Most hash tables have to handle collisions. Also, hash tables tend to allocate space for all buckets whether they're empty or not, whereas your pseudo-code documents a create_extra_space_for_bucket() step during insertion of new keys. Buckets may be pointers though.Lahr
P
-1

Hashtable inside contains cans in which it stores the key sets. The Hashtable uses the hashcode to decide to which the key pair should plan. The capacity to get the container area from Key's hashcode is known as hash work. In principle, a hash work is a capacity which when given a key, creates an address in the table. A hash work consistently returns a number for an item. Two equivalent items will consistently have a similar number while two inconsistent objects may not generally have various numbers. When we put objects into a hashtable then it is conceivable that various objects may have equal/ same hashcode. This is known as a collision. To determine collision, hashtable utilizes a variety of lists. The sets mapped to a single array index are stored in a list and then the list reference is stored in the index.

Paulapauldron answered 30/11, 2021 at 15:1 Comment(1)
Welcome to Stack Overflow. There are already 16 answers here, some of which are very detailed an highly upvoted. Does this answer improve upon what's already here? Please read How to Answer.Moulton

© 2022 - 2024 — McMap. All rights reserved.