How to create a trie in Python
Asked Answered
A

16

164

I'm interested in tries and DAWGs (direct acyclic word graph) and I've been reading a lot about them but I don't understand what should the output trie or DAWG file look like.

  • Should a trie be an object of nested dictionaries? Where each letter is divided in to letters and so on?
  • Would a lookup performed on such a dictionary be fast if there are 100k or 500k entries?
  • How to implement word-blocks consisting of more than one word separated with - or space?
  • How to link prefix or suffix of a word to another part in the structure? (for DAWG)

I want to understand the best output structure in order to figure out how to create and use one.

I would also appreciate what should be the output of a DAWG along with trie.

I do not want to see graphical representations with bubbles linked to each other, I want to know the output object once a set of words are turned into tries or DAWGs.

Airlift answered 13/6, 2012 at 12:56 Comment(1)
Read kmike.ru/python-data-structures for a survey of exotic data structures in PythonGaselier
D
221

Unwind is essentially correct that there are many different ways to implement a trie; and for a large, scalable trie, nested dictionaries might become cumbersome -- or at least space inefficient. But since you're just getting started, I think that's the easiest approach; you could code up a simple trie in just a few lines. First, a function to construct the trie:

>>> _end = '_end_'
>>> 
>>> def make_trie(*words):
...     root = dict()
...     for word in words:
...         current_dict = root
...         for letter in word:
...             current_dict = current_dict.setdefault(letter, {})
...         current_dict[_end] = _end
...     return root
... 
>>> make_trie('foo', 'bar', 'baz', 'barz')
{'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}}, 
             'z': {'_end_': '_end_'}}}, 
 'f': {'o': {'o': {'_end_': '_end_'}}}}

If you're not familiar with setdefault, it simply looks up a key in the dictionary (here, letter or _end). If the key is present, it returns the associated value; if not, it assigns a default value to that key and returns the value ({} or _end). (It's like a version of get that also updates the dictionary.)

Next, a function to test whether the word is in the trie:

>>> def in_trie(trie, word):
...     current_dict = trie
...     for letter in word:
...         if letter not in current_dict:
...             return False
...         current_dict = current_dict[letter]
...     return _end in current_dict
... 
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'baz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barzz')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'bart')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'ba')
False

I'll leave insertion and removal to you as an exercise.

Of course, Unwind's suggestion wouldn't be much harder. There might be a slight speed disadvantage in that finding the correct sub-node would require a linear search. But the search would be limited to the number of possible characters -- 27 if we include _end. Also, there's nothing to be gained by creating a massive list of nodes and accessing them by index as he suggests; you might as well just nest the lists.

Finally, I'll add that creating a directed acyclic word graph (DAWG) would be a bit more complex, because you have to detect situations in which your current word shares a suffix with another word in the structure. In fact, this can get rather complex, depending on how you want to structure the DAWG! You may have to learn some stuff about Levenshtein distance to get it right.

Davide answered 13/6, 2012 at 13:56 Comment(6)
There, change made. I'd stick with dict.setdefault() (it's underutilised and not nearly well-known enough), in part because it helps prevent bugs that are too easy to create with a defaultdict (where you'd not get a KeyError for non-existing keys on indexing). The only thing now that would make it useable for production code is using _end = object() :-)Masquerade
@MartijnPieters hmmm, I specifically chose not to use object, but I can't remember why. Perhaps because it would be hard to interpret when seen in the demo? I guess I could make an end object with a custom reprDavide
Sorry, but this ain't a true Trie implementation as it doest offer prefix searchIncense
@PrithivirajDamodaran, I would say that "trie" is the name of a data structure. "Prefix search" is the name of an algorithm that you can use with a trie. I haven't implemented prefix search here, but that doesn't keep the data structure from being a trie.Davide
@PrithivirajDamodaran See my answer to have prefix search here: https://mcmap.net/q/149402/-how-to-create-a-trie-in-pythonLeontine
I think root = dict() should be root = {}Abreu
L
40

Here is a list of python packages that implement Trie:

  • marisa-trie - a C++ based implementation.
  • python-trie - a simple pure python implementation.
  • PyTrie - a more advanced pure python implementation.
  • pygtrie - a pure python implementation by Google.
  • datrie - a double array trie implementation based on libdatrie.
Lockman answered 23/1, 2014 at 8:36 Comment(1)
marisa-trie: v0.7.7 2021-08-04 python-trie: v? 2013-09-21 pyTrie: v0.4.0 2020-10-21 pygtrie: v2.4.2 2021-01-03 datrie: v0.8.2 2020-03-26Massengale
E
31

Have a look at this:

https://github.com/kmike/marisa-trie

Static memory-efficient Trie structures for Python (2.x and 3.x).

String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.

Based on marisa-trie C++ library.

Here's a blog post from a company using marisa trie successfully:
https://www.repustate.com/blog/sharing-large-data-structure-across-processes-python/

At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server.

...

I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.

What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation.

There are also a couple of pure-python implementations, though unless you're on a restricted platform you'd want to use the C++ backed implementation above for best performance:

Etch answered 16/10, 2012 at 11:22 Comment(2)
last commit was in April 2018, last major commit was in like 2017Titos
marisa-trie seems to have had a spate of activity recently, maybe a new maintainerEtch
S
26

Modified from senderle's method (above). I found that Python's defaultdict is ideal for creating a trie or a prefix tree.

from collections import defaultdict

class Trie:
    """
    Implement a trie with insert, search, and startsWith methods.
    """
    def __init__(self):
        self.root = defaultdict()

    # @param {string} word
    # @return {void}
    # Inserts a word into the trie.
    def insert(self, word):
        current = self.root
        for letter in word:
            current = current.setdefault(letter, {})
        current.setdefault("_end")

    # @param {string} word
    # @return {boolean}
    # Returns if the word is in the trie.
    def search(self, word):
        current = self.root
        for letter in word:
            if letter not in current:
                return False
            current = current[letter]
        if "_end" in current:
            return True
        return False

    # @param {string} prefix
    # @return {boolean}
    # Returns if there is any word in the trie
    # that starts with the given prefix.
    def startsWith(self, prefix):
        current = self.root
        for letter in prefix:
            if letter not in current:
                return False
            current = current[letter]
        return True

# Now test the class

test = Trie()
test.insert('helloworld')
test.insert('ilikeapple')
test.insert('helloz')

print test.search('hello')
print test.startsWith('hello')
print test.search('ilikeapple')
Squamulose answered 8/5, 2015 at 13:1 Comment(3)
My understanding of space complexity is O(n*m). Some have discussion here. #2719316Squamulose
@Squamulose u are using defaultdict only for the first char only. Rest chars still use normal dict. Would be better to use nested defaultdict.Meatiness
Actually, the code doesn't seem to be "using" the defaultdict for the first character either since it doesn't set the default_factory and is still using set_default.Climacteric
H
16

Using defaultdict and reduce function.

Create Trie

from functools import reduce
from collections import defaultdict
T = lambda : defaultdict(T)
trie = T()
reduce(dict.__getitem__,'how',trie)['isEnd'] = True

Trie :

defaultdict(<function __main__.<lambda>()>,
            {'h': defaultdict(<function __main__.<lambda>()>,
                         {'o': defaultdict(<function __main__.<lambda>()>,
                                      {'w': defaultdict(<function __main__.<lambda>()>,
                                                   {'isEnd': True})})})})

Search In Trie :

curr = trie
for w in 'how':
    if w in curr:
        curr = curr[w]
    else:
        print("Not Found")
        break
if curr['isEnd']:
    print('Found')
Hinckley answered 11/10, 2020 at 11:44 Comment(1)
GREAT implementation!! Thank you so much.Archil
K
15

There's no "should"; it's up to you. Various implementations will have different performance characteristics, take various amounts of time to implement, understand, and get right. This is typical for software development as a whole, in my opinion.

I would probably first try having a global list of all trie nodes so far created, and representing the child-pointers in each node as a list of indices into the global list. Having a dictionary just to represent the child linking feels too heavy-weight, to me.

Kristof answered 13/6, 2012 at 12:59 Comment(2)
once again, thank you however I still think that your answer needs a bit more deeper explanation and clarification since my question is aimed at figuring out the logic and structure of the functionality of DAWGs and TRIEs. Your further input will be very useful and appreciated.Airlift
Unless you use objects with slots, your instance namespace will be dictionaries anyway.Woodall
S
11
from collections import defaultdict

Define Trie:

_trie = lambda: defaultdict(_trie)

Create Trie:

trie = _trie()
for s in ["cat", "bat", "rat", "cam"]:
    curr = trie
    for c in s:
        curr = curr[c]
    curr.setdefault("_end")

Lookup:

def word_exist(trie, word):
    curr = trie
    for w in word:
        if w not in curr:
            return False
        curr = curr[w]
    return '_end' in curr

Test:

print(word_exist(trie, 'cam'))
Shang answered 19/3, 2018 at 7:32 Comment(3)
caution: this returns True only for an entire word, but not for prefix, for prefix change return '_end' in curr to return TrueSheepwalk
@ShrikantShete You must also add if '_end' in curr: return True before if w not in curr: return False if you want prefix search.Leontine
It doesn't affect the impl, but just an unrelated note that curr.setdefault("_end") adds key-value pair of "_end", None to the dictionary curr even when curr is a defaultdict. i.e. dict.setdefault(key [,default]) is not aware of the defaultdict's default_factory (in this case, _trie). At least, that's what I saw when I tried it.Bunns
T
10

Here is full code using a TrieNode class. Also implemented auto_complete method to return the matching words with a prefix.

Since we are using dictionary to store children, there is no need to convert char to integer and vice versa and don't need to allocate array memory in advance.

class TrieNode:
    def __init__(self):
        #Dict: Key = letter, Item = TrieNode
        self.children = {}
        self.end = False
class Trie:
    def __init__(self):
        self.root = TrieNode()

    def build_trie(self,words):       
        for word in words:
            self.insert(word)

    def insert(self,word):
        node = self.root
        for char in word:
            if char not in node.children:
              node.children[char] = TrieNode()
            node = node.children[char]
        node.end = True
    def search(self, word):
        node = self.root
        for char in word:
            if char in node.children:
                node = node.children[char]
            else:
                return False
            
        return node.end

    def _walk_trie(self, node, word, word_list):

        if node.children:   
            for char in node.children:        
                word_new = word + char
                if node.children[char].end:       
                # if node.end: 
                    word_list.append( word_new)
                    # word_list.append( word)
                self._walk_trie(node.children[char],  word_new  , word_list)

    def auto_complete(self, partial_word):
        node = self.root

        word_list = [ ]
        #find the node for last char of word
        for char in  partial_word:
           if char in node.children:
              node = node.children[char]
           else:
                # partial_word not found return 
                return word_list
         
        if node.end:
             word_list.append(partial_word)

        #  word_list will be created in this method for suggestions that start with partial_word
        self._walk_trie(node, partial_word, word_list)
        return word_list

create a Trie

t = Trie()
words = ['hi', 'hieght', 'rat', 'ram', 'rattle', 'hill']
t.build_trie(words)

Search for word

words = ['hi', 'hello']
for word in  words:
    print(word, t.search(word))

hi True
hel False

search for words using prefix

partial_word = 'ra'
t.auto_complete(partial_word)

['rat', 'rattle', 'ram']
Tormoria answered 22/3, 2021 at 7:0 Comment(0)
S
4

If you want a TRIE implemented as a Python class, here is something I wrote after reading about them:

class Trie:

    def __init__(self):
        self.__final = False
        self.__nodes = {}

    def __repr__(self):
        return 'Trie<len={}, final={}>'.format(len(self), self.__final)

    def __getstate__(self):
        return self.__final, self.__nodes

    def __setstate__(self, state):
        self.__final, self.__nodes = state

    def __len__(self):
        return len(self.__nodes)

    def __bool__(self):
        return self.__final

    def __contains__(self, array):
        try:
            return self[array]
        except KeyError:
            return False

    def __iter__(self):
        yield self
        for node in self.__nodes.values():
            yield from node

    def __getitem__(self, array):
        return self.__get(array, False)

    def create(self, array):
        self.__get(array, True).__final = True

    def read(self):
        yield from self.__read([])

    def update(self, array):
        self[array].__final = True

    def delete(self, array):
        self[array].__final = False

    def prune(self):
        for key, value in tuple(self.__nodes.items()):
            if not value.prune():
                del self.__nodes[key]
        if not len(self):
            self.delete([])
        return self

    def __get(self, array, create):
        if array:
            head, *tail = array
            if create and head not in self.__nodes:
                self.__nodes[head] = Trie()
            return self.__nodes[head].__get(tail, create)
        return self

    def __read(self, name):
        if self.__final:
            yield name
        for key, value in self.__nodes.items():
            yield from value.__read(name + [key])
Sara answered 12/7, 2013 at 16:38 Comment(2)
Thank you @NoctisSkytower. This is great to begin with but I kind of gave up on Python and TRIES or DAWGs due to extremely high memory consumption of Python in these case scenarios.Airlift
That's what ____slots____ is for. It reduces the amount of memory used by a class, when you have many instances of it.Imes
S
3

This version is using recursion

import pprint
from collections import deque

pp = pprint.PrettyPrinter(indent=4)

inp = raw_input("Enter a sentence to show as trie\n")
words = inp.split(" ")
trie = {}


def trie_recursion(trie_ds, word):
    try:
        letter = word.popleft()
        out = trie_recursion(trie_ds.get(letter, {}), word)
    except IndexError:
        # End of the word
        return {}

    # Dont update if letter already present
    if not trie_ds.has_key(letter):
        trie_ds[letter] = out

    return trie_ds

for word in words:
    # Go through each word
    trie = trie_recursion(trie, deque(word))

pprint.pprint(trie)

Output:

Coool👾 <algos>🚸  python trie.py
Enter a sentence to show as trie
foo bar baz fun
{
  'b': {
    'a': {
      'r': {},
      'z': {}
    }
  },
  'f': {
    'o': {
      'o': {}
    },
    'u': {
      'n': {}
    }
  }
}
Summarize answered 13/6, 2016 at 12:24 Comment(0)
G
2

This is much like a previous answer but simpler to read:

def make_trie(words):
    trie = {}
    for word in words:
        head = trie
        for char in word:
            if char not in head:
                head[char] = {}
            head = head[char]
        head["_end_"] = "_end_"
    return trie
Glazer answered 18/8, 2020 at 9:7 Comment(1)
The assignment head[char] = {} is just manually implementing collections.defaultdict or dict.setdefault(letter, {})Yuriyuria
V
1
class TrieNode:
    def __init__(self):
        self.keys = {}
        self.end = False

class Trie:
    def __init__(self):
        self.root = TrieNode()
    def insert(self, word: str, node=None) -> None:
        if node == None:
            node = self.root
        # insertion is a recursive operation
        # this is base case to exit the recursion
        if len(word) == 0:
            node.end = True
            return
        # if this key does not exist create a new node
        elif word[0] not in node.keys:
            node.keys[word[0]] = TrieNode()
            self.insert(word[1:], node.keys[word[0]])
        # that means key exists
        else:
            self.insert(word[1:], node.keys[word[0]])
    def search(self, word: str, node=None) -> bool:
        if node == None:
            node = self.root
        # this is positive base case to exit the recursion
        if len(word) == 0 and node.end == True:
            return True
        elif len(word) == 0:
            return False
        elif word[0] not in node.keys:
            return False
        else:
            return self.search(word[1:], node.keys[word[0]])
    def startsWith(self, prefix: str, node=None) -> bool:
        if node == None:
            node = self.root
        if len(prefix) == 0:
            return True
        elif prefix[0] not in node.keys:
            return False
        else:
            return self.startsWith(prefix[1:], node.keys[prefix[0]])
Varney answered 16/10, 2021 at 1:8 Comment(0)
S
0
class Trie:
    head = {}

    def add(self,word):

        cur = self.head
        for ch in word:
            if ch not in cur:
                cur[ch] = {}
            cur = cur[ch]
        cur['*'] = True

    def search(self,word):
        cur = self.head
        for ch in word:
            if ch not in cur:
                return False
            cur = cur[ch]

        if '*' in cur:
            return True
        else:
            return False
    def printf(self):
        print (self.head)

dictionary = Trie()
dictionary.add("hi")
#dictionary.add("hello")
#dictionary.add("eye")
#dictionary.add("hey")


print(dictionary.search("hi"))
print(dictionary.search("hello"))
print(dictionary.search("hel"))
print(dictionary.search("he"))
dictionary.printf()

Out

True
False
False
False
{'h': {'i': {'*': True}}}
Swanger answered 24/7, 2019 at 15:27 Comment(0)
S
0

Python Class for Trie


Trie Data Structure can be used to store data in O(L) where L is the length of the string so for inserting N strings time complexity would be O(NL) the string can be searched in O(L) only same goes for deletion.

Can be clone from https://github.com/Parikshit22/pytrie.git

class Node:
    def __init__(self):
        self.children = [None]*26
        self.isend = False
        
class trie:
    def __init__(self,):
        self.__root = Node()
        
    def __len__(self,):
        return len(self.search_byprefix(''))
    
    def __str__(self):
        ll =  self.search_byprefix('')
        string = ''
        for i in ll:
            string+=i
            string+='\n'
        return string
        
    def chartoint(self,character):
        return ord(character)-ord('a')
    
    def remove(self,string):
        ptr = self.__root
        length = len(string)
        for idx in range(length):
            i = self.chartoint(string[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                raise ValueError("Keyword doesn't exist in trie")
        if ptr.isend is not True:
            raise ValueError("Keyword doesn't exist in trie")
        ptr.isend = False
        return
    
    def insert(self,string):
        ptr = self.__root
        length = len(string)
        for idx in range(length):
            i = self.chartoint(string[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                ptr.children[i] = Node()
                ptr = ptr.children[i]
        ptr.isend = True
        
    def search(self,string):
        ptr = self.__root
        length = len(string)
        for idx in range(length):
            i = self.chartoint(string[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                return False
        if ptr.isend is not True:
            return False
        return True
    
    def __getall(self,ptr,key,key_list):
        if ptr is None:
            key_list.append(key)
            return
        if ptr.isend==True:
            key_list.append(key)
        for i in range(26):
            if ptr.children[i]  is not None:
                self.__getall(ptr.children[i],key+chr(ord('a')+i),key_list)
        
    def search_byprefix(self,key):
        ptr = self.__root
        key_list = []
        length = len(key)
        for idx in range(length):
            i = self.chartoint(key[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                return None
        
        self.__getall(ptr,key,key_list)
        return key_list
        

t = trie()
t.insert("shubham")
t.insert("shubhi")
t.insert("minhaj")
t.insert("parikshit")
t.insert("pari")
t.insert("shubh")
t.insert("minakshi")
print(t.search("minhaj"))
print(t.search("shubhk"))
print(t.search_byprefix('m'))
print(len(t))
print(t.remove("minhaj"))
print(t)

Code Oputpt

True
False
['minakshi', 'minhaj']
7
minakshi
minhajsir
pari
parikshit
shubh
shubham
shubhi

Southern answered 25/7, 2020 at 6:35 Comment(0)
L
0

With prefix search

Here is @senderle's answer, slightly modified to accept prefix search (and not only whole-word matching):

_end = '_end_'

def make_trie(words):
    root = dict()
    for word in words:
        current_dict = root
        for letter in word:
            current_dict = current_dict.setdefault(letter, {})
        current_dict[_end] = _end
    return root

def in_trie(trie, word):
    current_dict = trie
    for letter in word:
        if _end in current_dict:
            return True
        if letter not in current_dict:
            return False
        current_dict = current_dict[letter]
        
t = make_trie(['hello', 'hi', 'foo', 'bar'])
print(in_trie(t, 'hello world')) 
# True
Leontine answered 7/9, 2021 at 11:19 Comment(0)
M
0

In response to @basj

The following code will capture \b (end of word) letters.

_end = '_end_'

def make_trie(words):
    root = dict()
    for word in words:
        current_dict = root
        for letter in word:
            current_dict = current_dict.setdefault(letter, {})
        current_dict[_end] = _end
    return root

def in_trie(trie, word):
    current_dict = trie
    for letter in word:
        if letter not in current_dict:              # Adjusted the
            return False                            # order of letter
        if _end in current_dict[letter]:            # checks to capture
            return True                             # the last letter.
        current_dict = current_dict[letter]
        
t = make_trie(['hello', 'hi', 'foo', 'bar'])

>>> print(in_trie(t, 'hi'))
True
>>> print(in_trie(t, 'hola'))
False
>>> print(in_trie(t, 'hello friend'))
True
>>> print(in_trie(t, 'hel'))
None
Mirage answered 3/1, 2023 at 21:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.