TL;DR: Hash tables guarantee O(1)
expected worst case time if you pick your hash function uniformly at random from a universal family of hash functions. Expected worst case is not the same as average case.
Disclaimer: I don't formally prove hash tables are O(1)
, for that have a look at this video from coursera [1]. I also don't discuss the amortized aspects of hash tables. That is orthogonal to the discussion about hashing and collisions.
I see a surprisingly great deal of confusion around this topic in other answers and comments, and will try to rectify some of them in this long answer.
Reasoning about worst case
There are different types of worst case analysis. The analysis that most answers have made here so far is not worst case, but rather average case [2]. Average case analysis tends to be more practical. Maybe your algorithm has one bad worst case input, but actually works well for all other possible inputs. Bottomline is your runtime depends on the dataset you're running on.
Consider the following pseudocode of the get
method of a hash table. Here I'm assuming we handle collision by chaining, so each entry of the table is a linked list of (key,value)
pairs. We also assume the number of buckets m
is fixed but is O(n)
, where n
is the number of elements in the input.
function get(a: Table with m buckets, k: Key being looked up)
bucket <- compute hash(k) modulo m
for each (key,value) in a[bucket]
return value if k == key
return not_found
As other answers have pointed out, this runs in average O(1)
and worst case O(n)
. We can make a little sketch of a proof by challenge here. The challenge goes as follows:
(1) You give your hash table algorithm to an adversary.
(2) The adversary can study it and prepare as long as he wants.
(3) Finally the adversary gives you an input of size n
for you to insert in your table.
The question is: how fast is your hash table on the adversary input?
From step (1) the adversary knows your hash function; during step (2) the adversary can craft a list of n
elements with the same hash modulo m
, by e.g. randomly computing the hash of a bunch of elements; and then in (3) they can give you that list. But lo and behold, since all n
elements hash to the same bucket, your algorithm will take O(n)
time to traverse the linked list in that bucket. No matter how many times we retry the challenge, the adversary always wins, and that's how bad your algorithm is, worst case O(n)
.
How come hashing is O(1)?
What threw us off in the previous challenge was that the adversary knew our hash function very well, and could use that knowledge to craft the worst possible input.
What if instead of always using one fixed hash function, we actually had a set of hash functions, H
, that the algorithm can randomly choose from at runtime? In case you're curious, H
is called a universal family of hash functions [3]. Alright, let's try adding some randomness to this.
First suppose our hash table also includes a seed r
, and r
is assigned to a random number at construction time. We assign it once and then it's fixed for that hash table instance. Now let's revisit our pseudocode.
function get(a: Table with m buckets and seed r, k: Key being looked up)
rHash <- H[r]
bucket <- compute rHash(k) modulo m
for each (key,value) in a[bucket]
return value if k == key
return not_found
If we try the challenge one more time: from step (1) the adversary can know all the hash functions we have in H
, but now the specific hash function we use depends on r
. The value of r
is private to our structure, the adversary cannot inspect it at runtime, nor predict it ahead of time, so he can't concoct a list that's always bad for us. Let's assume that in step (2) the adversary chooses one function hash
in H
at random, he then crafts a list of n
collisions under hash modulo m
, and sends that for step (3), crossing fingers that at runtime H[r]
will be the same hash
they chose.
This is a serious bet for the adversary, the list he crafted collides under hash
, but will just be a random input under any other hash function in H
. If he wins this bet our run time will be worst case O(n)
like before, but if he loses then well we're just being given a random input which takes the average O(1)
time. And indeed most times the adversary will lose, he wins only once every |H|
challenges, and we can make |H|
be very large.
Contrast this result to the previous algorithm where the adversary always won the challenge. Handwaving here a bit, but since most times the adversary will fail, and this is true for all possible strategies the adversary can try, it follows that although the worst case is O(n)
, the expected worst case is in fact O(1)
.
Again, this is not a formal proof. The guarantee we get from this expected worst case analysis is that our run time is now independent of any specific input. This is a truly random guarantee, as opposed to the average case analysis where we showed a motivated adversary could easily craft bad inputs.
hashCode()
method is implemented for aString
. grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/… – Wilkie