Prefix search in a radix tree/patricia trie

Asked 27/4, 2009 at 18:9 Answered 18/6, 2013 at 20:54

Solved c++algorithm prefix patricia-trie

I'm currently implementing a radix tree/patricia trie (whatever you want to call it). I want to use it for prefix searches in a dictionary on a severely underpowered piece of hardware. It's supposed to work more or less like auto-completion, i. e. showing a list of words that the typed prefix matches.

My implementation is based on this article, but the code therein doesn't include prefix searches, though the author says:

[...] Say you want to enumerate all the nodes that have keys with a common prefix "AB". You can perform a depth first search starting at that root, stopping whenever you encounter back edges.

But I don't see how that is supposed to work. For example, if I build a radix tree from these words:

illness
imaginary
imagination
imagine
imitation
immediate
immediately
immense
in

I will get the exact same "best match" for the prefixes "i" and "in" so that it seems difficult to me to gather all matching words just by traversing the tree from that best match.

Additionally, there is a radix tree implementation in Java that has an implemented prefix search in RadixTreeImpl.java. That code explicitly checks all nodes (starting from a certain node) for a prefix match - it actually compares bytes.

Can anyone point me to a detailed description on implementing a prefix search on radix trees? Is the algorithm used in the Java implementation the only way to do it?

Hematite answered 27/4, 2009 at 18:9 Comment(0)

Think about what your trie encodes. At each node, you have the path that lead you to that node, so in your example, you start at Λ (that's a capital Lambda, this greek font kind of sucks) the root node corresponding to an empty string. Λ has children for each letter used, so in your data set, you have one branch, for "i".

Λ
Λ→"i"

At the "i" node, there are two children, one for "m" and one for "n". The next letter is "n", so you take that,

Λ→"i"→"n"

and since the only word that starts "i","n" in your data set is "in", there are no children from "n". That's a match.

Now, let's say the data set, instead of having "in", had "infindibulum". (What SF I'm referencing is left as an exercise.) You'd still get to the "n" node the same way, but then if the next letter you get is "q", you know the word doesn't appear in your data set at all, because there's no "q" branch. At that point, you say "okay, no match." (Maybe you then start adding the word, maybe not, depending on the application.)

But if the next letter is "f", you can keep going. You can short circuit that with a little craft, though: once you reach a node that represents a unique path, you can hang the whole string off that node. When you get to that node, you know that the rest of the string must be "findibulum", so you've used the prefix to match the whole string, and return it.

How your you use that? in a lot of non-UNIX command interpreters, like the old VAX DCL, you could use any unique prefix of a command. So, the equivalent of ls(1) was DIRECTORY, but no other command started with DIR, so you could type DIR and that was as good as doing the whole word. If you couldn't remember the correct command, you could type just 'D', and hit (I think) ESC; the DCL CLI would return you all the commands that started with D, which it could search extremely fast.

Schuck answered 27/4, 2009 at 18:26 Comment(7)

It's quite possible that my understanding of patricia tries is flawed, but I believe that such tries store - in contrast to normal tries - complete words and always have two child nodes. In my example, I have a root node which has a self reference and "illness" as a child. "illness" in turn has "in" as a child and also a self reference. "imaginary" is a child of "in" and has a reference to "illness" as well as to "imitation". And so on... – Hematite 27/4, 2009 at 18:55

You're right that what I wrote talks about a trie in general; PATRICIA tries use less space, but are more complicated. The basic idea is the same though. Here's another nice page: csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/PATRICIA – Schuck 27/4, 2009 at 19:7

Basically, a PATRICIA complresses out many of the nodes by storing strings for long chains of single nodes. – Schuck 27/4, 2009 at 19:8

Which means that I have to compare the string's characters with the prefix I search for, right? – Hematite 27/4, 2009 at 19:23

well, you're pretty much going to ahve to no matter what. i'm not sure I understand the question. – Schuck 27/4, 2009 at 19:39

I was hoping I could skip comparing individual characters after finding the "best match" as no such thing is explicitly mentioned in the article at CodeProject I cited and linked to. Now that I read it again I begin to think by "perform a depth first search" the author means one has to compare the string to the prefix. – Hematite 27/4, 2009 at 19:56

Yeah, you'll certainly have to complete the search at least as far as the shortest unique prefix. – Schuck 27/4, 2009 at 20:57

It turns out the GNU extensions for the standard c++ lib includes a Patricia trie implementation. It's found under the policy-based data-structures extension. See http://gcc.gnu.org/onlinedocs/libstdc++/ext/pb_ds/trie_based_containers.html

Increscent answered 2/3, 2010 at 16:14 Comment(1)

What a find; I never imagined gcc has such implementations in its ext namespace! Please note that it is available only in later version 4.x series gcc. – Tempa 1/10, 2013 at 17:51

An alternative algorithm: Keep It Simple Stupid!

Just make a sorted list of your keywords. When you have a prefix, binary search to find where that prefix would be located in the list. All of your possible completions will be found starting at that index, ready to be accessed in place.

This algorithm will will require only 5% of the code of a Patricia trie and will be easy to maintain, understand, and update. It is almost certain this simple list search will be more efficient as well.

The only downside is if you have huge numbers of long keywords with similar prefixes, a trie can save some storage since it doesn't need to keep the full prefix for every entry. In practice, if you have less than a few million words, this is not a savings because the pointer overhead of the tree will dominate. This savings is more for applications like searching databases of DNA strings with millions of characters, not text keywords.

Compulsive answered 28/4, 2009 at 7:13 Comment(2)

Maybe I should just compare those two solutions to each other considering that my patricia trie is basically done. – Hematite 28/4, 2009 at 8:1

Noting that, in practice, many languages/engines the strings will be tied into a Ropes implementation, potentially reducing the memory footprint (perhaps not as much as a Patricia Trie, but narrowing the gap). – Heavyladen 29/5 at 16:1

Another alternative algo is a ternary search tree (more memory efficient) https://github.com/varunpant/TernaryTree/tree/master/TernaryTree

Migration answered 18/6, 2013 at 20:54 Comment(0)

Recommended topics

Hot tags