Memory Efficient data structure for Trie Implementation

Asked 19/4, 2015 at 18:46 Answered 30/8, 2016 at 18:11

I am implementing a Trie in python. Till now I have come across two different methods to implement it:

1) use a class Node (similar to struct Node in C++) with data members:

char - to store character
is_end - to store end of word (true or false)
prefix_count - store number of words with current prefix

child - Node type dict (to store other nodes i.e. for 26 alphabets)

class Node(object):
    def __init__(self):
        self.char = ''
        self.word = ''
        self.is_end = False
        self.prefix_count = 0
        self.child = {}

2) use a dictionary to store all the data:

e.g. for the input words = {'foo', 'bar', 'baz', 'barz'}

we derive this dictionary:

         {'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}},
          'z': {'_end_': '_end_'}}},
          'f': {'o': {'o': {'_end_': '_end_'}}}}

Which is the efficient and standard data structure, which is both memory efficient and fast for traversal and other trie operations on big dataset of words?

Church answered 19/4, 2015 at 18:46 Comment(11)

github.com/kmike/marisa-trie – Demetrademetre 19/4, 2015 at 18:59

How do you plan to reference objects of Node in self.child, considering this is a dictionary? If indeed you keep it as a dict, and somehow refer Node objects, I see both the methods as having same time complexity, but 1st one having more space complexity. And if you refer self.child as a list, then the 1st one might be a bit slower – Ibnrushd 19/4, 2015 at 19:2

Thanks for the reply. each child in Node will have another object of type Node, which will make it a n-ary tree. @Ibnrushd – Church 19/4, 2015 at 19:4

As per me, first one is more structured and has rich information compared to the second one. Also first one will be helpful to add further functionalities like del, count, etc... but still want some opinions for the practical use. @Ibnrushd – Church 19/4, 2015 at 19:9

#11015820 – Demetrademetre 19/4, 2015 at 19:9

@PadraicCunningham already referred it. My question is based on the same topic but a little different. Also I am not going with the 'key' logic used in marisa-trie. – Church 19/4, 2015 at 19:14

@divyum, Without specific details on what exactly you want to do you can only expect opinions based on peoples personal preference. The answers in that question discuss pros and cons and provide examples. – Demetrademetre 19/4, 2015 at 19:17

@PadraicCunningham, so requirement is to implement a simple trie but in such a way that it can be used for a big dataset of words (already mentioned in question). In that question, its mentioned "for a large, scalable trie, nested dictionaries might become cumbersome", so I am just asking is the first method better for large, scalable trie. P.S. I am not implementing the key logic used in marisa-trie. Hope the requirements are more clear now. – Church 19/4, 2015 at 19:25

@Church yes the first one allows you a more structured form which solves the cumbersome part, but at the same time occupies more space. If you are talking about huge datasets, the thing you worry about is the time complexity. You need a traversal to be O(len(search_str)) regardless of the trie. Both the above methods would yield the exact same thing. – Ibnrushd 19/4, 2015 at 20:19

If you use dict to implement trie, you're [almost] cheating. – Nanceynanchang 7/5, 2015 at 14:0

Note: trie is not an efficient data-structure. Use tree. See: cs.cmu.edu/~ckingsf/bioinfo-lectures/suffixtrees.pdf – Gorgoneion 9/4, 2016 at 10:17

Why not both? Just yesterday I was implementing a similar data structure for storing and retrieving a hierarchy of objects and contemplated this exact situation. Ended up using a Node object with a children dictionary. The main node being an object allows you to have custom methods for printing it or getting stuff, and you can even have a lazy initialization if needed (you mentioned big dataset right?)

class Node(object):
    def __init__(self):
        self.children = collections.defaultdict(lambda:self.__class__())
        self.stuff = ...
    def __str__(self,depth=0):
        if self.children:
            return str(self.stuff)+'\n'+'\n'.join([' '*depth+k+' ' + v.__str__(depth+1) for k,v in self.children.items()])
        else:
            return str(self.stuff)
    def get_path(self,path):
        node = self
        while path:
            node = node.children[path.pop(0)]
        ...

Cordi answered 30/4, 2015 at 21:7 Comment(1)

yes, I mentioned big dataset. It can obviously depends on the requirements but my reason was to make it structured along with some other functionalities like counting prefixes, suggesting words, etc... With the first implementation we can add more functionalities but in the second one we shall be bound to specific ones. I wanted to make it more scalable and real time. – Church 30/4, 2015 at 21:34

A direct replacement would be nested a list;

However [arguably] more Pythonic, more compact in memory and thus faster for lookup would be a nested tuple.

Of course, updating such trie becomes logN, as you'll have to recreate every ancestor node. Still, if your lookups are a lot more frequent than updates, it may be worth it.

Nanceynanchang answered 7/5, 2015 at 14:4 Comment(0)

-3

Trie is failure when it comes to space complexity. Trie tends to use lots of memory for processing and operating. But to avoid this problem there is a datastructure know as succinct data structure. Try implementing that here.

Refer here for more information.

Tearle answered 30/8, 2016 at 18:11 Comment(0)

1) use a class Node (similar to struct Node in C++) with data members:

2) use a dictionary to store all the data:

Recommended topics

Hot tags