Finding items in an universal hash table?
Asked Answered
H

3

24

If items are organized randomly, how does the table know where to start looking?

In a non-random table items are organized according to some characteristic. (i.e. name). So if the table needs to look up some arbitrary information about "John", it can start looking in the 'J' bucket.

In a universal hash table though, items are arranged randomly. There's no defining characteristic. Therefore to find some arbitrary info about "John", wouldn't the table have to look through every bucket?

Isn't that a huge waste of time? It's like looking through every cabinet in your house to find a spoon.

Hiddenite answered 2/5, 2012 at 15:2 Comment(2)
Key to this is to understand the randomized part of a randomized data structure (or randomized algo) :)Balaton
A proper hash function should generate deterministic input (that is, for some input x, it should always generate the same hash h each time it's run on x). This allows hash tables to operate as they do, where keys are actually just hashes generated from inputs.Gustation
R
6

A hash looks more or less random, but it's deterministic -- that is, a particular input always produces the same hash value.

Based on that, when you want to insert an item in a hash table, you start by generating the hash for that input. You then use that to index into the table, and insert your item at that spot in the table. In a typical case, you have one part that's treated as the key, and you have some more information associated with that (e.g., you might be able to look up people by name, and with each name you have information about that person).

Later, when you want to look up (the information associated with) a particular key (in this case, the person) you enter and hash the key to find the right spot in the hash table to look for that information.

That does skip over a few crucial details, such as how you handle the two or more inputs happening to produce the same hash value (which is unavoidable unless you place some limits on the allowable inputs). There are various ways to handle this, such as just looking through the table sequentially to find the next free spot, re-hashing to find another spot in the table, or building something like a linked list of items that hashed to the same value.

In any case, it should probably be added that there are use cases for which a hash table does end up a bit like you've surmised. Just for one example, when you want to see all the contents of a hash table (rather than just looking up one item at a time), you really do normally scan through the whole table. Even if your hash table is nearly empty, you normally have to scan from one end to the other, looking for every entry that's actually in use. When you do that, you get items in an order that appears fairly random.

This points to another shortcoming of hash tables -- you generally need a precise match with a single original record for them to work well. For example, let's consider some queries based on my last name. Assuming you indexed on the entire last name, it would be trivial to find "Coffin" -- but at least with most normal hash functions, searching for "something starting with "Cof" would be dramatically slower, as would "find all the names between "Coffin" and "Demming".

As such, you're sort of half correct -- while hash tables are typically very fast for a few specific cases (primarily searching for an exact match), the general idea you've outlined (scanning through the entire table to find the data) is nearly the only choice available for some other purpose, so if you want to support anything but exact matches, a different data structure may be preferable.

That's dealing primarily the the most typical uses/types of hash tables. It's possible to create hash functions that at least bend (if not outright break) those rules to varying degrees. In most cases these involve some compromises though. For example, given geographic information as input, you could create a hash (of sorts) by simply truncating the coordinates (or one of them, anyway) to get the same information at lower precision. This organizes the information to at least a degree, so things that are close together end up with similar hash values, making it easier to find neighboring data. This, however, will often lead to more collisions (e.g., you'll get a lot of items hashing to the same value for the downtown of a large city).

Looking specifically at universal hashing, this adds one extra element to the puzzle: instead of a single hash function, you have a family of hash functions from which you choose "randomly". When universal hashing is used to implement a hash table (which isn't always--it's also often used for things like message authentication codes) you typically do not choose the hash function randomly every time you insert an item. Rather, you typically choose a hash, and continue to use it until you encounter some fixed number of collisions. Then you randomly choose another hash function.

For example, in Cuckoo hashing (probably the most commonly used universal hash), you hash your key to find a location. If it's already occupied, you "kick out" the existing item there, and re-hash it to find an alternate location for it. It gets inserted there. If that slot is already occupied, it "kicks out" the item already in that slot, and the pattern repeats.

When you're searching for an item, you hash it and look at that location. If that's empty, you immediately know your item isn't present. If that slot's occupied, but doesn't contain your item, you re-hash to find the alternate location. Continue this pattern for as many hash functions as you're using (normally only two in the case of cuckoo hashing, but you could obviously use an otherwise similar algorithm with more functions).

It is possible for this to fail--to enter an infinite loop, or (almost equivalently) to build a chain that exceeds some pre-set length. In this case, you start over, re-building the table with a different pair of hash functions.

When using open hashing (of which universal hashing is one form) deletion tends to be non-trivial. In particular, we have to ensure that when we remove an item at one location, it wasn't the beginning of a chain of items that collided at that location. In many cases, it's most efficient to simply add a third state for a slot: if it's never been occupied, it's just empty. If it's currently occupied, it's in use. If an item has been deleted there, it's deleted. This way when you're searching for an item, if you encounter a "deleted" slot, you continue searching for your item (whereas, if you get to a slot that's never been used, you can stop searching immediately--your item clearly was never inserted).

Remonaremonetize answered 8/5, 2012 at 20:50 Comment(0)
R
67

While the previous answers are essentially correct, they don't directly address the random part of a universal hashing algorithm. Universal hashing algorithms do not use randomness when calculating a hash for a key. Random numbers are only used during the initialization of the hash table to choose a hash function from a family of hash functions. This prevents an adversary with access to the details of the hash function from devising a worst case set of keys.

In other words, during the lifetime of the hash table, the bucket for a given key is consistent. However, a different instance (such as next time the program runs) may place that same key in a different bucket.

Repulsive answered 5/8, 2012 at 3:50 Comment(9)
Thank you!, you made the key clarification statement:"Random numbers are only used during the initialization of the hash table to choose a hash function from a family of hash functions".Hazard
I read the whole Cormen-Leiserson-Rivest-Stein Algorithms book chapter on this topic (universal hashing) and it still left me puzzled over how can I search for a key if I roll a new hash function every time I insert something. This answer put everything in place. Thanks!Disciplinant
This is the exact answer, nothing more nothing less.Ceciliacecilio
I mis-read the description of universal hashing as well. The Cormen/Leiserson book states "at the beginning of execution we select the hash function at random from a carefully designed class of functions." (my emphasis) and "the algorithm can behave differently on each execution, even for the same input" (pg 265). I had interpreted "execution" as the execution of the function to insert an item. It's now clear this means execution of the application or per instantiation of the hashing class.Unaneled
@cod3monk3y, I interpreted the text as the way you did and that's why I came here to find the answer, finally.Pompey
I still wanted to clarify one thing. Does it mean that "the" hash function chosen from a family of hash functions at the initialization stage of a hash table is used to hash keys for the lifetime of the hash table, right? If it is the case, I don't understand what is so random about this algorithm. It seems to me that after a hash function has been chosen, its "randomness" is equivalent to the division method.Pompey
@KinCheung I feel the same, Did you find the answer to this?Micaelamicah
Point to note is , suppose you open source your software , you also open source your hash function and adversary can exploit this and create data set which leads to lot of collisions and hence lead to bad performance of hash table. Now with Universal Hashing approach, you design a family of hash function and then choose one randomly at initialization stage. Now a hacker cannot determine which hash function is actually in use during run time , because the choice of choosing hash function is random.Faircloth
The hash handler may choose to pick a new function at any point, rehashing all elements with it and throwing away the old function afterwards. Thus, at any given moment there is only one active hash function, but it need not be the same throughout the life of the hash table. Good opportunities to make the switch are: unconditionally, when resizing due to the load factor; or if it detects the current function is producing too many collisions.Bryanbryana
R
6

A hash looks more or less random, but it's deterministic -- that is, a particular input always produces the same hash value.

Based on that, when you want to insert an item in a hash table, you start by generating the hash for that input. You then use that to index into the table, and insert your item at that spot in the table. In a typical case, you have one part that's treated as the key, and you have some more information associated with that (e.g., you might be able to look up people by name, and with each name you have information about that person).

Later, when you want to look up (the information associated with) a particular key (in this case, the person) you enter and hash the key to find the right spot in the hash table to look for that information.

That does skip over a few crucial details, such as how you handle the two or more inputs happening to produce the same hash value (which is unavoidable unless you place some limits on the allowable inputs). There are various ways to handle this, such as just looking through the table sequentially to find the next free spot, re-hashing to find another spot in the table, or building something like a linked list of items that hashed to the same value.

In any case, it should probably be added that there are use cases for which a hash table does end up a bit like you've surmised. Just for one example, when you want to see all the contents of a hash table (rather than just looking up one item at a time), you really do normally scan through the whole table. Even if your hash table is nearly empty, you normally have to scan from one end to the other, looking for every entry that's actually in use. When you do that, you get items in an order that appears fairly random.

This points to another shortcoming of hash tables -- you generally need a precise match with a single original record for them to work well. For example, let's consider some queries based on my last name. Assuming you indexed on the entire last name, it would be trivial to find "Coffin" -- but at least with most normal hash functions, searching for "something starting with "Cof" would be dramatically slower, as would "find all the names between "Coffin" and "Demming".

As such, you're sort of half correct -- while hash tables are typically very fast for a few specific cases (primarily searching for an exact match), the general idea you've outlined (scanning through the entire table to find the data) is nearly the only choice available for some other purpose, so if you want to support anything but exact matches, a different data structure may be preferable.

That's dealing primarily the the most typical uses/types of hash tables. It's possible to create hash functions that at least bend (if not outright break) those rules to varying degrees. In most cases these involve some compromises though. For example, given geographic information as input, you could create a hash (of sorts) by simply truncating the coordinates (or one of them, anyway) to get the same information at lower precision. This organizes the information to at least a degree, so things that are close together end up with similar hash values, making it easier to find neighboring data. This, however, will often lead to more collisions (e.g., you'll get a lot of items hashing to the same value for the downtown of a large city).

Looking specifically at universal hashing, this adds one extra element to the puzzle: instead of a single hash function, you have a family of hash functions from which you choose "randomly". When universal hashing is used to implement a hash table (which isn't always--it's also often used for things like message authentication codes) you typically do not choose the hash function randomly every time you insert an item. Rather, you typically choose a hash, and continue to use it until you encounter some fixed number of collisions. Then you randomly choose another hash function.

For example, in Cuckoo hashing (probably the most commonly used universal hash), you hash your key to find a location. If it's already occupied, you "kick out" the existing item there, and re-hash it to find an alternate location for it. It gets inserted there. If that slot is already occupied, it "kicks out" the item already in that slot, and the pattern repeats.

When you're searching for an item, you hash it and look at that location. If that's empty, you immediately know your item isn't present. If that slot's occupied, but doesn't contain your item, you re-hash to find the alternate location. Continue this pattern for as many hash functions as you're using (normally only two in the case of cuckoo hashing, but you could obviously use an otherwise similar algorithm with more functions).

It is possible for this to fail--to enter an infinite loop, or (almost equivalently) to build a chain that exceeds some pre-set length. In this case, you start over, re-building the table with a different pair of hash functions.

When using open hashing (of which universal hashing is one form) deletion tends to be non-trivial. In particular, we have to ensure that when we remove an item at one location, it wasn't the beginning of a chain of items that collided at that location. In many cases, it's most efficient to simply add a third state for a slot: if it's never been occupied, it's just empty. If it's currently occupied, it's in use. If an item has been deleted there, it's deleted. This way when you're searching for an item, if you encounter a "deleted" slot, you continue searching for your item (whereas, if you get to a slot that's never been used, you can stop searching immediately--your item clearly was never inserted).

Remonaremonetize answered 8/5, 2012 at 20:50 Comment(0)
S
1

A hash table is not organised randomly. It is organised by hash value. The table is searched by hash value to retrieve the correct hash group.

Starryeyed answered 3/5, 2012 at 11:20 Comment(2)
I realize that, but the hash value is random isn't? As in the value assigned to each piece of info? For example if John is assigned a hash value of 2 and George is assigned a hash value of 5, if the table is asked to find George wouldn't it have to search through the 5 bucket and the 3 bucket? The hash value assignment is random, so how could the table possibly know the hash value of the information that needs to be searched up?Hiddenite
The hash value is calculated. If you want to look for George, then you recalculate the hash, to get 5. Then you go straight to the 5 bucket. Essentially, you are using the hash value as an index into your hash table. A good hash method will scatter the inputs randomly across the available buckets, so all the buckets are about the same size. That does not mean that the hash values are assigned randomly. See isthe.com/chongo/tech/comp/fnv for an example of how a hash may be calculated, in this case the FNV hash.Starryeyed

© 2022 - 2024 — McMap. All rights reserved.