Parallel algorithm for constructing a trie?

Asked 15/1, 2013 at 16:42 Answered 15/1, 2013 at 17:31

Solved string algorithm data-structures parallel-processing trie

Because the trie data structure has such a huge branching factor and each subtree is totally independent of the others, it seems like there ought to be a way to enormously speed up trie construction for a given dictionary by adding all the words in parallel.

My initial idea on how to do this was the following: associate a mutex with each pointer in the trie (including the pointer to the root) and then have each thread follow the normal algorithm for inserting a word into the trie. However, before following any pointers, a thread must first acquire the lock on that pointer so that if it needs to add a new child node to the trie, it can do so without introducing any data races.

The catch with this approach is that it uses an enormous number of locks - one for each pointer in the trie - and does an enormous number of acquires and releases - one for each character in each input string.

Is there a way to build a trie in parallel without using nearly as many locks?

Southerland answered 15/1, 2013 at 16:42 Comment(0)

An obvious lock-free algorithm would be:

Bucket-sort the input strings by a length-k prefix (ordinarily, k=1, but with small alphabets, increase k).
For each letter, construct a trie containing the k-suffix of all the strings starting with that letter.
Merge the tries from the previous step (when k=1, just add a root node).

Assuming a uniform distribution of prefixes, this can give you a linear speedup up to the size of the alphabet to the power k.

Substance answered 15/1, 2013 at 17:9 Comment(1)

I think a non-uniform distribution of prefixes would also be fine if you create enough keys and your workers take them from a queue. Larger jobs will be larger but with enough of them, almost all of them will be processed in parallel before the queue is empty. – Madrepore 24/10, 2021 at 7:57

It just occurred to me that this can be made lock-free by using atomic test-and-set operations on the pointers instead of locks. Specifically, if a thread wants to follow a pointer, it does the following:

Atomically read the pointer value.
If the pointer is non-null, follow it. You're done.
Otherwise, allocate a new node.
Atomically test the pointer for null and set it to the new node if it is null.
(Remark: The pointer is definitely non-null here. Either we just set it, or it was set by another thread).
Follow the pointer.

Depending on the hardware, this could be much faster, since it avoids the effort of locking and unlocking all the time and ensures that no thread is ever waiting indefinitely.

One downside is that the number of allocations involved goes up, since multiple threads might all try allocating a node to put in the trie at a certain spot, but only one can put it there. Fortunately, this can be mitigated by the following optimization: if a thread ever allocates a node unnecessarily, rather than immediately freeing it, it just stores the node in temporary space. If later on it needs to allocate a new node, it can use the cached one. If not, it can free it at the very end.

Hope this helps!

Southerland answered 15/1, 2013 at 17:8 Comment(2)

Your fix for the overallocation problem seems to be equivalent to using a pool allocator. – Substance 15/1, 2013 at 17:12

@larsmans- I was hoping for it to be a "lock free pool" where each thread keeps its own nodes separately from every other thread. I thought that this might reduce locking and unlocking, though perhaps I'm reinventing the wheel and this has already been figured out. – Southerland 15/1, 2013 at 17:13

Well, there is the obvious trade off between fine VS coarse granularity of setting a lock for a set of nodes (rather then one).

A simple way to do it is via hashing - have m different locks, and for each node you want to access acquire the lock numbered hash(node) % m.
Note that this approach is basically a generalization of the suggested approach (with perfect hashing and n == m), and of the serial approach (with m == 1).

Another thing that might be utilized is optimistic design - if the approach will actually increase the performance depends on the distribution of the dictionary and the size of the trie of course, and might help a lot if collisions tend to be very rare (which might be the case for a dictionary of very long words).
The idea is to just add the words to the trie without any synchronization, and if you encounter a collision - roll back to the last known stable state (this of course requires taking a snap shot of the data, and might be infeasible if we are speaking about streams of data that cannot be stored).

Scraper answered 15/1, 2013 at 16:55 Comment(2)

How would there be deadlock involved? No thread ever needs to hold more than one lock at a time, unless I'm mistaken. – Southerland 15/1, 2013 at 17:3

@templatetypedef: Actually - because of the nature of the trie - I think you are correct, I had some other protocol in mind (related to file systems), where in order to ensure no data races - one must hold the entire path from a certain node to the lead at each point. However - it is not the case here. I'll edit this part out. – Scraper 15/1, 2013 at 17:7

Depending on what your dictionary looks like, you might not need the locks at all, if you can get each thread to build independent subtrees. If this isn't an online algorithm, presort the words by prefix (first letter if you have < 26 threads, first and second if you have more or you know that the data isn't balanced, for example, 90% of the words start with A). Basically, this would be an O(n) operation, where you do one pass to count how many words there are that start with a given letter, then one pass to sort (along the lines of a radix sort by your chosen prefix). Then, divide the prefixes between the threads and have each thread build these independent sub trees. Finally, have one thread add each of these subtrees to the root. I'll walk through an example below.

Your dictionary:
Bark
Apple
Cookie
And
Baby
Corn
Blue
Cake
Bacon

After sorting:
Apple
And
Bark
Baby
Blue
Bacon
Corn
Cookie
Cake

Then we divide prefixes among the threads. For this example we have 3 threads, which get the prefixes [A][B][C], and build the following trees:

A --|         B -------|           C -------|    
P   N     |-- A ---|   L           O ---|   A
P   D     R   B    C   U           O    R   K
L         K   Y    O   E           K    N   E
E                  N               I
                                   E

And then you have one thread combine these at the root like:

|----------- Root------------------|
A --|         B -------|           C -------|    
P   N     |-- A ---|   L           O ---|   A
P   D     R   B    C   U           O    R   K
L         K   Y    O   E           K    N   E
E                  N               I
                                   E

I hope that made sense.

Advantages to this method: The threads work essentially independently, and you don't have the overhead of having to deal with acquiring and releasing locks.

Drawbacks to this method: If you don't know anything about the dictionary, there may be a severe workload imbalance, and in the worst case (Say, all the words start with 'A') it reverts to basically being a single thread building a tree. There are a few ways to make this better, for example, you can add some checks when you sort such that if the workload is severely imbalanced when dealing with a single letter prefix, to resort by the first 2 letters, but really you can't guarantee that it will be balanced.

You also may have idle threads at times if say you have 20 threads and sort by first letter, then you there will be some 6 threads that have to do two subtrees, while 14 of them sit idle for half the time. You may be able to further subdivide the subtrees to deal with this, but that's extra time spent on a preprocessing step.

Anyway, no guarantees that this is faster than your method, but its something to consider.

Millwork answered 15/1, 2013 at 17:31 Comment(2)

This is a good answer, but isn't this just what @larsmans proposed? – Southerland 15/1, 2013 at 17:33

Indeed it is. I started typing it when there were no answers, and by the time I posted it, @larsman had submitted his answer, so I didn't see it. That's what I get for being verbose. – Millwork 15/1, 2013 at 18:6

Recommended topics

Hot tags