Build trie faster
Asked Answered
M

10

23

I'm making an mobile app which needs thousands of fast string lookups and prefix checks. To speed this up, I made a Trie out of my word list, which has about 180,000 words.

Everything's great, but the only problem is that building this huge trie (it has about 400,000 nodes) takes about 10 seconds currently on my phone, which is really slow.

Here's the code that builds the trie.

public SimpleTrie makeTrie(String file) throws Exception {
    String line;
    SimpleTrie trie = new SimpleTrie();

    BufferedReader br = new BufferedReader(new FileReader(file));
    while( (line = br.readLine()) != null) {
        trie.insert(line);
    }
    br.close();

    return trie;
}

The insert method which runs on O(length of key)

public void insert(String key) {
    TrieNode crawler = root;
    for(int level=0 ; level < key.length() ; level++) {
        int index = key.charAt(level) - 'A';
        if(crawler.children[index] == null) {
            crawler.children[index] = getNode();
        }
        crawler = crawler.children[index];
    }
    crawler.valid = true;
}

I'm looking for intuitive methods to build the trie faster. Maybe I build the trie just once on my laptop, store it somehow to the disk, and load it from a file in the phone? But I don't know how to implement this.

Or are there any other prefix data structures which will take less time to build, but have similar lookup time complexity?

Any suggestions are appreciated. Thanks in advance.

EDIT

Someone suggested using Java Serialization. I tried it, but it was very slow with this code:

public void serializeTrie(SimpleTrie trie, String file) {
        try {
            ObjectOutput out = new ObjectOutputStream(new BufferedOutputStream(new FileOutputStream(file)));
            out.writeObject(trie);
            out.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public SimpleTrie deserializeTrie(String file) {
        try {
            ObjectInput in = new ObjectInputStream(new BufferedInputStream(new FileInputStream(file)));
            SimpleTrie trie = (SimpleTrie)in.readObject();
            in.close();
            return trie;
        } catch (IOException | ClassNotFoundException e) {
            e.printStackTrace();
            return null;
        }
    }

Can this above code be made faster?

My trie: http://pastebin.com/QkFisi09

Word list: http://www.isc.ro/lists/twl06.zip

Android IDE used to run code: http://play.google.com/store/apps/details?id=com.jimmychen.app.sand

Manicotti answered 23/9, 2013 at 16:0 Comment(6)
I cannot install the ide on an android gingerbread?Faro
I would suggest to start from profiling. At least measuring which part is spent for (1) reading from file, (2) finding location in trie and (3) creating a new nodeDeakin
@Manicotti Did you ever try the binary search technique? I saw good results with it.Denims
@Denims Yes I did try it but it didn't seem too fast. I just need two queries: whether a prefix exists, and whether a word exists. I don't need all strings that start from a prefix. Btw, I counted the number of prefix existence searches, it was about 10,000.. so the binary search method was slower, because with the dawg, the whole algorithm finished in ~60 ms.Manicotti
@Manicotti OK, good that you found a solution. I never found a prefix query which was slower than a 1 millisecond and same for existence of a single string but maybe I have a faster phone.Denims
Performance comparison DAFSA memory consumed: 16020976 DAFSA (ms) : [100] 0 DAFSA (ms) : [10000] 5 DAFSA (ms) : [1000000] 28 --------------- trie memory consumed: 12946984 trie (ms) : [100] 0 trie (ms) : [10000] 6 trie (ms) : [1000000] 131 --------------- List memory consumed: 1761728 List (ms) : [100] 23 List (ms) : [10000] 696 List (ms) : [1000000] 71752 --------------- Set memory consumed: 2341616 Set (ms) : [100] 0 Set (ms) : [10000] 1 Set (ms) : [1000000] 22Assurgent
C
25

Double-Array tries are very fast to save/load because all data is stored in linear arrays. They are also very fast to lookup, but the insertions can be costly. I bet there is a Java implementation somewhere.

Also, if your data is static (i.e. you don't update it on phone) consider DAFSA for your task. It is one of the most efficient data structures for storing words (must be better than "standard" tries and radix tries both for size and for speed, better than succinct tries for speed, often better than succinct tries for size). There is a good C++ implementation: dawgdic - you can use it to build DAFSA from command line and then use a Java reader for the resulting data structure (example implementation is here).

Cremona answered 27/9, 2013 at 21:8 Comment(11)
Hi. After a lot of head banging, I've been successfully able to create the DAWG and read it from Java. It's small (537K) and blazing fast. However, there is one problem that prevents me from closing this question permanently - the Github code can only check whether a string is a prefix of any word in the dictionary, it can't check whether the string is a complete word. I've wasted my whole day trying to figure out this. My app can't work without that. Can you please take a look at it?Manicotti
@Bruce: You could append some unused symbol (like '$') at the end of each dictionary word. Then you just search 'word' for a prefix and 'word$' for a complete word.Garey
@EvgenyKluev Yeah, I could do that - however I really think this functionality is there in that code - I just can't find it. Waiting for Mikhail's reply. Btw, here's some code to test the DAWG: dl.dropboxusercontent.com/u/19729481/DawgTest.7z Not working as expected.Manicotti
Hi @Bruce, there is a check missing in 'contains' method - it should return True only if there is a value associated with an index (return hasValue(index) instead of return true should work). I haven't tested/used linked Java implementation myself; it may work well for the software it was written for, but not as a general Java implementation. Sorry about wasting your time. This Python implementation is heavily tested and I'm pretty sure it works correctly: github.com/kmike/DAWG-Python/blob/… - consult with it if in doubt.Cremona
ah, and there is the "canonical" C++ source, of course: code.google.com/p/dawgdic/source/browse/trunk/src/dawgdic/…Cremona
I'm trying to read the dawg from dawg_python to verify indices which are returned against the Java one. Why is this code returning an empty array? print dawg_python.DAWG().load('dawg.bin').prefixes(u'MAX')Manicotti
I tried return hasValue(index), didn't work. I verified with C++ and Python code. Can you please look at the Java code once when you have time? You're probably the most familiar person with the implementation! If not so, where can I learn the internal file structure of the dawg, so I can debug it myself?Manicotti
For your convenience.. Here's a link to an Eclipse project for testing the Java implementation of the dawg, so you don't have to make it yourself - dl.dropboxusercontent.com/u/19729481/DawgTest.7zManicotti
I think that there are 2 possible issues. 1) it seems that you've used make-dawg command line utility which requires LF line endings, and TWL06.txt file uses CRLF line endings. 'HELLO\r'in dawg_python.DAWG().load('dawg.bin') returns True. 2) Second issue is that you haven't used '-g' option, so DAWG was created without guide, and fast key completion (i.e. find all keys that start with a given prefix) doesn't work. Please note that d.prefixes(u'MAX') finds all words that are prefixes of u'MAX' (empty because of '\r' issue), not all words that starts with u'MAX'.Cremona
@MikhailKorobov YES. That was it. The CRLF line endings and the hasValue(index) together wreaked all the havoc. I cannot express in words how thankful I am for all your help. Thank you for being kind enough to help me through this. You are an awesome person. I'll now send a pull request to that Java repo, and whole-heartedly accept this wonderful answer. Enjoy your 5k! :)Manicotti
@Manicotti Do you have this hosted somewhere on github or elsewhere I can take a look ? I need to do something similar in c# and would be a great help if I have some sample code to browse. ThanksExaggerated
V
3

You could store your trie as an array of nodes, with references to child nodes replaced with array indices. Your root node would be the first element. That way, you could easily store/load your trie from simple binary or text format.

public class SimpleTrie {
    public class TrieNode {
        boolean valid;
        int[] children;
    }
    private TrieNode[] nodes;
    private int numberOfNodes;

    private TrieNode getNode() {
        TrieNode t = nodes[++numberOnNodes];
        return t;
    }
}
Vindictive answered 27/9, 2013 at 18:50 Comment(3)
I thought about this but couldn't go forward with it. How do I represent the recursive structure of the trie? How are the parent and child indices in the array related? How can it be guaranteed that it generates the exact same trie, and not an other trie which has the same byte representation?Manicotti
@Manicotti - I don't see the problem. The recursive structure of the tree is defined by those index values, which you're serializing along with everything else. The parent and child indices are related in that the child index is stored in the parent node, replacing the child reference. You serialize by iterating through the whole array, ignoring the Trie structure. An index is an index, whether it's in a file or an array. You don't have to do binary serialization (but can if you want) - if you serialize one node per text line (e.g. a CSV file) the node numbers are also line numbers.Iconoclast
Oh I'm sorry I completely misread that yesterday, too tired I guess. Now I get it, so simple. Will try and let you know.Manicotti
D
3

Just build a large String[] and sort it. Then you can use binary search to find the location of a String. You can also do a query based on prefixes without too much work.

Prefix look-up example:

Compare method:

private static int compare(String string, String prefix) {
    if (prefix.length()>string.length()) return Integer.MIN_VALUE;

    for (int i=0; i<prefix.length(); i++) {
        char s = string.charAt(i);
        char p = prefix.charAt(i);
        if (s!=p) {
            if (p<s) {
                // prefix is before string
                return -1;
            }
            // prefix is after string
            return 1;
        }
    }
    return 0;
}

Finds an occurrence of the prefix in the array and returns it's location (MIN or MAX are mean not found)

private static int recursiveFind(String[] strings, String prefix, int start, int end) {
    if (start == end) {
        String lastValue = strings[start]; // start==end
        if (compare(lastValue,prefix)==0)
            return start; // start==end
        return Integer.MAX_VALUE;
    }

    int low = start;
    int high = end + 1; // zero indexed, so add one.
    int middle = low + ((high - low) / 2);

    String middleValue = strings[middle];
    int comp = compare(middleValue,prefix);
    if (comp == Integer.MIN_VALUE) return comp;
    if (comp==0)
        return middle;
    if (comp>0)
        return recursiveFind(strings, prefix, middle + 1, end);
    return recursiveFind(strings, prefix, start, middle - 1);
}

Gets a String array and prefix, prints out occurrences of prefix in array

private static boolean testPrefix(String[] strings, String prefix) {
    int i = recursiveFind(strings, prefix, 0, strings.length-1);
    if (i==Integer.MAX_VALUE || i==Integer.MIN_VALUE) {
        // not found
        return false;
    }
    // Found an occurrence, now search up and down for other occurrences
    int up = i+1;
    int down = i;
    while (down>=0) {
        String string = strings[down];
        if (compare(string,prefix)==0) {
            System.out.println(string);
        } else {
            break;
        }
        down--;
    }
    while (up<strings.length) {
        String string = strings[up];
        if (compare(string,prefix)==0) {
            System.out.println(string);
        } else {
            break;
        }
        up++;
    }
    return true;
}
Denims answered 27/9, 2013 at 18:53 Comment(5)
Dictionary words. Trust me, I need to find whether a key is a prefix of any word in the dictionary in O(length) time. Anything otherwise gives huge time penalties. By using an array, how do I find out that a key is prefix of any word?Manicotti
Using binary search, you should be able to find prefixes in O(log N) where N is the number of words in the dictionary. I'll add some code to my answer to give an example.Denims
@Manicotti Using the algorithm above, it took less than 1 millisecond on my phone to find the the existence of a 3 letter prefix in a 200000 item string array.Denims
@Manicotti It also found over 1000 Strings with a given 3 character prefix in ~5 milliseconds.Denims
@Manicotti Ironically, on my phone this approach is performing as well as a Trie performing a prefix look-up and is beating (by a good margin) a Trie on returning all strings which contain the prefix.Denims
T
1

Here's a reasonably compact format for storing a trie on disk. I'll specify it by its (efficient) deserialization algorithm. Initialize a stack whose initial contents are the root node of the trie. Read characters one by one and interpret them as follows. The meaning of a letter A-Z is "allocate a new node, make it a child of the current top of stack, and push the newly allocated node onto the stack". The letter indicates which position the child is in. The meaning of a space is "set the valid flag of the node on top of the stack to true". The meaning of a backspace (\b) is "pop the stack".

For example, the input

TREE \b\bIE \b\b\bOO \b\b\b

gives the word list

TREE
TRIE
TOO

. On your desktop, construct the trie using whichever method and then serialize by the following recursive algorithm (pseudocode).

serialize(node):
    if node is valid: put(' ')
    for letter in A-Z:
        if node has a child under letter:
            put(letter)
            serialize(child)
            put('\b')
Tarr answered 27/9, 2013 at 20:29 Comment(0)
D
1

This isn't a magic bullet, but you can probably reduce your runtime slightly by doing one big memory allocation instead of a bunch of little ones.

I saw a ~10% speedup in the test code below (C++, not Java, sorry) when I used a "node pool" instead of relying on individual allocations:

#include <string>
#include <fstream>

#define USE_NODE_POOL

#ifdef USE_NODE_POOL
struct Node;
Node *node_pool;
int node_pool_idx = 0;
#endif

struct Node {
    void insert(const std::string &s) { insert_helper(s, 0); }
    void insert_helper(const std::string &s, int idx) {
        if (idx >= s.length()) return;
        int char_idx = s[idx] - 'A';
        if (children[char_idx] == nullptr) {
#ifdef USE_NODE_POOL
            children[char_idx] = &node_pool[node_pool_idx++];
#else
            children[char_idx] = new Node();
#endif
        }
        children[char_idx]->insert_helper(s, idx + 1);
    }
    Node *children[26] = {};
};

int main() {
#ifdef USE_NODE_POOL
    node_pool = new Node[400000];
#endif
    Node n;
    std::ifstream fin("TWL06.txt");
    std::string word;
    while (fin >> word) n.insert(word);
}
Detergency answered 28/9, 2013 at 4:7 Comment(0)
U
1

Tries that prealloate space all possible children (256) have a huge amount of wasted space. You are making your cache cry. Store those pointers to children in a resizable data structure.

Some tries will optimize by having one node to represent a long string, and break that string up only when needed.

Uncivilized answered 28/9, 2013 at 4:45 Comment(0)
D
0

Is it space inefficient or time inefficient? If you are rolling a plain trie then space may be part of the problem when dealing with a mobil device. Check out patricia/radix tries, especially if you are using it as a prefix look-up tool.

Trie: http://en.wikipedia.org/wiki/Trie

Patricia/Radix trie: http://en.wikipedia.org/wiki/Radix_tree

You didn't mention a language but here are two implementations of prefix tries in Java.

Regular trie: http://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/Trie.java

Patricia/Radix (space-effecient) trie: http://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/PatriciaTrie.java

Denims answered 23/9, 2013 at 23:2 Comment(6)
No, as I mentioned in the question, time is the problem not space. It must be taking around 40MB, which is feasible. I've already implemented everything - I'm just looking to speed it up. Please look at the edited question.Manicotti
@Manicotti I find it surprising that it takes 10 seconds to build a trie from 180k words. For example; building a trie from 200k on my local PC (2.0 GHz processor with 1GB memory) takes 471ms and consumes 34MBs, building a compressed trie from the same data takes 541ms and consumes 22MBs. I'd try an open source version and see if you get better results.Denims
"10 seconds on my phone"Manicotti
@Manicotti I understand but the performance of your trie is so much larger it's surprising. I'll run the same code on my HTC and check back in.Denims
Thanks! Btw, here's my trie: pastebin.com/QkFisi09 Word list: isc.ro/lists/twl06.zip I'm running it on this IDE: play.google.com/store/apps/…Manicotti
@Manicotti You'll be happy to know my Trie performs about the same are yours on my phone. It took about 7 seconds to load the Trie with 200K items.Denims
F
0

Instead of a simple file you can use a database like sqlite and a nested set or celko tree to store the trie and you can also build a faster and shorter (less nodes) trie with a ternary search trie.

Faro answered 28/9, 2013 at 4:39 Comment(0)
T
0

I don't like the idea of addressing nodes by index in array, but only because it requires one more addition (index to the pointer). But with array of preallocated nodes you will maybe save some time on allocation and initialization. And you can also save a lot of space by reserving first 26 indices for leaf nodes. Thus you'll not need to allocate and initialize 180000 leaf nodes.

Also with indices you will be able to read the prepared nodes array from disk in binary format. This has to be several times faster. But I'm not sure how to do this on your language. Is this Java?

If you checked that your source vocabulary is sorted, you may also save some time by comparing some prefix of the current string with the previous one. E.g. first 4 characters. If they are equal you can start your

for(int level=0 ; level < key.length() ; level++) {

loop from the 5-th level.

Thiouracil answered 29/9, 2013 at 11:40 Comment(0)
P
0

Generally speaking, avoid using a lot of object creations from scratch in Java, which is both slow and it also has a massive overhead. Better implement your own pooling class for memory management that allocates e.g. half a million entries at a time in one go.

Also, serialization is too slow for large lexicons. Use a binary read to populate array-based representations proposed above quickly.

Poll answered 19/4, 2020 at 11:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.