What's the size of a prefix tree (trie) that contains all the english words?
Asked Answered
E

3

11

Knowing that there are about 200k words in the english dictionary and the alphabet has 26 letters or so.

Exobiology answered 4/3, 2014 at 21:4 Comment(1)
Your estimate is low. According to the Oxford Dictionaries site, there are at least a quarter million English words, and that doesn't count technical words that the dictionary doesn't track.Tay
T
18

In this article, the author built a trie of English words from a file that was 935,015 bytes long. It required a quarter million nodes. He claims a compression ratio of approximately 73%, which is pretty close to what I remember from my work with such data structures.

Note that his implementation wastes a lot of memory by storing an array of 26 child pointers for every node. A much less expensive implementation would maintain just the pointers it needs, ordered by frequency of use. For example, it's kind of crazy to store 26 child node pointers for the letter q in a word, considering that it's highly unlikely that the character after q will be anything but u.

Sequentially searching that would take slightly longer than directly indexing an array, but it would make for considerable savings in memory. And the memory savings could result in a lot fewer cache misses, which could very well more than make up for the increased cost of linear search.

If you're interested in saving even more space, you can create a Directed Acyclic Word Graph, which also takes advantage of common endings as well as some other optimizations. For example, you can compress dangling endings into a single node.

Tay answered 5/3, 2014 at 3:45 Comment(0)
S
1

Using a simple prefix tree, the space requirement should be O(N*C) where C is the average number of characters per word and N is the number of words. This is because in the worst case, a Trie will store every character in each word. So a fair estimate would be around 1 million characters stored, or around 1 MB.

Segmentation answered 4/3, 2014 at 23:45 Comment(7)
Do you have any references for this? A trie of 600,000 English words will store considerably fewer than 600,000 nodes. For certain no trie I'm aware of stores "c", "ca", and "cat" for the word "cat". I think you need to read up on what a trie is and how it's stored. en.wikipedia.org/wiki/TrieTay
True enough, I was thinking of a more complex data structure used for sub-string searches, which is based on a Trie. Although, that might be O(NC) as well, now that I think about it. In that case, it's only O(NC) where C is the average number of characters, and that is high.Segmentation
Yes. (N*C) is the worst case, which can only occur if there are no common prefixes.Tay
I said that a trie of 600,000 English words would store fewer than 600,000 nodes. I meant that with 600,000 total characters making up the words.Tay
Given that the average number of characters is constant, the number of characters and words are both O(N). So I knew what you meant, even if I didn't catch the error. I pretty saw much 600,000 twice, skimmed your answer, and went to the link you gave to review it in more detail.Segmentation
Considering this article there are less than 200k words in use currently in the English language goo.gl/tlgxTWExobiology
Considering the following implementation goo.gl/J5i2br The number of links in a trie is between CN and CNk, where k is the average key length. Every key in the trie has a node containing its associated value that also has L links, so the number of links is at least LN. If the first characters of all the keys are different, then there is a node with L links for every key character, so the number of links is L times the total number of key characters, or LNk.Exobiology
E
0

Wolfram alpha says that the average length of word would be 5.1 chars http://www.wolframalpha.com/input/?i=average+english+word+length

If L=26, the number of letters in the alphabet and K=5.1 the average length of an English word

=> I would expect the space complexity to be somewhere around O(L^K) (L to power K)

The implementation in an actual language may vary, I suppose.

Exobiology answered 4/3, 2014 at 21:11 Comment(2)
Your estimate seems unfounded. L^K is the numbers of all K-length strings with an alphabet of L symbols, i.e. the only relevant figure it could possibly estimate is the number of words there are, but that's already given, it's not the quantity we're trying to find. Moreover, it's wrong even for that purpose, both in theory (it counts all possible strings that have the same length as an average English word, but most of those aren't English words and many English words have different length) and in practice (it gives about 8.8 billion instead of 200 thousand).Basso
In addition to @delnan's objections, the L^K number assumes no common prefixes. The entire point of a prefix tree is to take advantage of common prefixes. A quick search reveals empirical results ranging from (N*K)/4 to (N*K)/3 nodes, where N is the number of words and K is the average length of the words.Tay

© 2022 - 2024 — McMap. All rights reserved.