Hash Array Mapped Trie (HAMT)
Asked Answered
P

4

30

I am trying to get my head around the details of a HAMT. I'd have implemented one myself in Java just to understand. I am familiar with Tries and I think I get the main concept of the HAMT.

Basically,

Two types of nodes:

Key/Value

Key Value Node:
  K key
  V value

Index

Index Node:
  int bitmap (32 bits)
  Node[] table (max length of 32)
  1. Generate a 32-bit hash for an object.
  2. Step through the hash 5-bits at a time. (0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-31) note: the last step (7th) is only 2 bits.
  3. At each step, find the position of that 5-bit integer in the bitmap. e.g. integer==5 bitmap==00001
    1. If the bit is a 1 then that part of the hash exist.
    2. If the bit is a 0 then key doesn't exist.
  4. If the key exists, find it's index into the table by counting the number of 1s in the bitmap between 0 and the position. e.g. integer==6 bitmap==0101010101 index==3
    1. If the table points to a key/value node then compare the keys.
    2. If the table points to a index node then go to 2 moving ahead one step.

The part I don't quite understand is collision detection and mitigation. In the linked paper he alludes to it:

The existing key is then inserted in the new sub-hash table and the new key added. Each time 5 more bits of the hash are used the probability of a collision reduces by a factor of 1/32. Occasionally an entire 32 bit hash may be consumed and a new one must be computed to differentiate the two keys.

If I were to compute a "new" hash and store the object at that new hash; how would you ever be able to look-up the object in the structure? When doing a look-up, wouldn't it generate the "initial" hash and not the "re-computed hash".

I must be missing something.....

BTW: The HAMT performs fairly well, it's sits between a hash map and tree map in my tests.

Data Structure                    Add time   Remove time     Sorted add time Sorted remove time   Lookup time     Size     
Java's Hash Map                   38.67 ms   18 ms           30 ms           15 ms                16.33 ms        8.19 MB        
HAMT                              68.67 ms   30 ms           57.33 ms        35.33 ms             33 ms           10.18 MB       
Java's Tree Map                   86.33 ms   78 ms           75 ms           39.33 ms             76 ms           8.79 MB        
Java's Skip List Map              111.33 ms  106 ms          72 ms           41 ms                72.33 ms        9.19 MB     
Paderna answered 31/10, 2013 at 19:13 Comment(0)
K
8

HAMT is great and highly performant structure especially when one needs immutable objects, i.e. each time after any modification a new copy of a data structure is created!

As for your question on hash collisions, I have found a C# implementation (which is buggy now) that shows how it works: on each hash collision the key is rehashed and lookup is retried recursively until maximum iterations limit is reached.

Currently I am also exploring HAMP in functional programming context and learning existing code. There are several reference implementations of HAMT in Haskell as Data.HshMap and in Clojure as PersistenceHashMap.

There are some other simpler implementations on the web that do not deal with collisions, but they are useful to understand the concept. Here they are in Haskell and OCaml

I have found a nice summary article article that describes HAMT with pictures and links to original research papers by Phil Bagwell.

Related points:

While implementing HAMT in F# I have noticed that popCount function implementation described here really matters and gives 10-15% compared to naive implementation described in the next answers in the link. Not great, but a free lunch.

Related IntMap structures (Haskell and its port to F#) are very good when the key could be an integer and they implement related PATRICIA/Radix trie.

I believe all these implementation are very good to learn efficient immutable data structure and functional languages in all their beauty on these examples - they really shine together!

Kistler answered 7/11, 2013 at 9:27 Comment(3)
Thanks! Sad that these kind of answers get a few votes while others (very simple) get hundreds LOL. Do you know of any java based implementation?Keefer
The one from Clojure may probably work from Java. But I do not use JVM regularly and do not know it's ecosystem well.Kistler
Reading the article on HAMT I don’t see how it’d map sub words. Let’s say both: (0101 ; 010101) are keys. The 1st one is also path to 2nd key but is not flagged as key.Keefer
M
7

There's two sections of the paper I think you might of missed. The first is the bit immediately preceding the bit you quoted:

Or the key will collide with an existing one. In which case the existing key must be replaced with a sub-hash table and the next 5 bit hash of the existing key computed. If there is still a collision then this process is repeated until no collision occurs.

So if you have object A in the table and you add object B which clashes, the cell at which their keys clashed will be a pointer to another subtable (where they don't clash).

Next, Section 3.7 of the paper you linked describes the method for generating a new hash when you run off the end of your first 32 bits:

The hash function was tailored to give a 32 bit hash. The algorithm requires that the hash can be extended to an arbitrary number of bits. This was accomplished by rehashing the key combined with an integer representing the trie level, zero being the root. Hence if two keys do give the same initial hash then the rehash has a probability of 1 in 2^32 of a further collision.

If this doesn't seem to explain anything, say and I'll extend this answer with more detail.

Majewski answered 23/11, 2013 at 9:31 Comment(13)
Thanks for this; I am still trying to comprehend it at the moment. I'll get back to you.Paderna
Section 3.7 assumes the hash is a function of multiple inputs with the trie level as some sort of "seed", but in languages like Java and C#, often the hash code is a primitive method on object and thus has only a single input (the object). I can't see any way to make section 3.7 apply in such languages, which means the best you can do is just store a collision list at the leaf and search it linearly.Lacewing
Wrap objects being put into the data structure in a Node class, then insert the Node rather than the wrapped object into the HAMT. You can then define the necessary multi-arg hash function on the Node class.Majewski
The hash generated from the underlying object is still the same, which means any hash generated from the Node would also be the same, unless some non-deterministic value is included, like the Node address. But you can't possibly know the Node address when searching for your key, so you can insert to infinite depth, but you can no longer perform lookups. I don't see how your suggestion solves the problem.Lacewing
So Node's customHash function takes the primitive hash of the object the Node contains and combines it with the integer representing the tree level (passed as an argument), and returns the hash of the result. When you want to test whether an object is in the HAMT, you stick it in a Node object and use customHash to find it's location in the level. At no point are you referencing the Node's primitive hash.Majewski
Let's consider a specific example: we have two keys, k1 and k2 where k1!=k2, and their colliding hash code is 42. Show me any customHash that can take 42 and the trie level and deterministically produce a different output for k1 and k2.Lacewing
Aha, sorry, I see your issue now. So the problem isn't that the hashcode can't take multiple inputs, it's that it can't be extended up to uniqueness. I don't see how this is a problem particular to managed languages like Java or C# though?Majewski
The only solution I can see in any language is that every type inserted into the HAMT has to have a custom function that'll return a hash that's extensible up to equivalence-class uniqueness.Majewski
Yes, uniqueness is the goal, I'm just saying the only way to achieve it is to have the primitive hash code function be a function of multiple inputs. But this isn't possible in Java/C# unless you write some custom, generic hash function that uses reflection, or as you say above, by requiring each type to implement a custom hash code function.Lacewing
Which languages do have such a suitable primitive hashcode?Majewski
None have it primitively, but you can write one generically in C/C++ and Ada since they have pointers. You can also require such a hash function in Haskell since it supports ad-hoc overloading via type classes. The only way to do it in an ad-hoc fashion in Java/C# is via reflection, or just give up and do what I did with my C# HAMT and store collisions in a leaf that's searched linearly.Lacewing
As in, you use the address of the object as your hashcode? That seems a little dangerous since it's not constant over the life of the object. You could cache it on creation of course, but that leaves you no better off than defining your own hash function. And can't you mimic the Haskell approach in Java/C# using subclassing and constrained type parameters? Wouldn't work on final classes I suppose.Majewski
Not the address for the hash, but the function would look like "int hash(void* obj, int nbytes, int seed)". So the hash of the contents of the object pointed to. As for Java/C#, requiring clients of your HAMT to subclass doesn't seem reasonable. You can mimic the Haskell approach by passing in an explicit delegate (C#)/anonymous class (Java) as the hash function. Still kind of annoying, and collisions shouldn't be so common that linear search is a problem except for huge trees. Edit: and the passed in hash function needs to actually use the seed correctly for the hash to work properly.Lacewing
W
1

If I were to compute a "new" hash and store the object at that new hash; how would you ever be able to look-up the object in the structure? When doing a look-up, wouldn't it generate the "initial" hash and not the "re-computed hash".

When doing a look-up the initial hash is used. When the bits in the initial hash is exhausted, either one of the following condition is true:

  1. we end up with a key/value node - return it
  2. we end up with an index node - this is the hint that we have to go deeper by recomputing a new hash.

The key here is hash bits exhaustion.

Witchy answered 25/7, 2014 at 7:32 Comment(0)
L
0

The chance of collision is presumably very low, and generally only problematic for huge trees. Given this, you're better off just storing collisions in an array at the leaf and searching it linearly (I do this in my C# HAMT).

Lacewing answered 20/4, 2014 at 14:40 Comment(1)
While this doesn't stick to the strict definition of a HAMT, it may be the best approach; I'll measure the extra memory costs to carry around another array or linked-list. I'll let you know.Paderna

© 2022 - 2024 — McMap. All rights reserved.