Knowing that there are about 200k words in the english dictionary and the alphabet has 26 letters or so.
In this article, the author built a trie of English words from a file that was 935,015 bytes long. It required a quarter million nodes. He claims a compression ratio of approximately 73%, which is pretty close to what I remember from my work with such data structures.
Note that his implementation wastes a lot of memory by storing an array of 26 child pointers for every node. A much less expensive implementation would maintain just the pointers it needs, ordered by frequency of use. For example, it's kind of crazy to store 26 child node pointers for the letter q
in a word, considering that it's highly unlikely that the character after q
will be anything but u
.
Sequentially searching that would take slightly longer than directly indexing an array, but it would make for considerable savings in memory. And the memory savings could result in a lot fewer cache misses, which could very well more than make up for the increased cost of linear search.
If you're interested in saving even more space, you can create a Directed Acyclic Word Graph, which also takes advantage of common endings as well as some other optimizations. For example, you can compress dangling endings into a single node.
Using a simple prefix tree, the space requirement should be O(N*C) where C is the average number of characters per word and N is the number of words. This is because in the worst case, a Trie will store every character in each word. So a fair estimate would be around 1 million characters stored, or around 1 MB.
Wolfram alpha says that the average length of word would be 5.1 chars http://www.wolframalpha.com/input/?i=average+english+word+length
If L=26, the number of letters in the alphabet and K=5.1 the average length of an English word
=> I would expect the space complexity to be somewhere around O(L^K) (L to power K)
The implementation in an actual language may vary, I suppose.
L^K
number assumes no common prefixes. The entire point of a prefix tree is to take advantage of common prefixes. A quick search reveals empirical results ranging from (N*K)/4
to (N*K)/3
nodes, where N
is the number of words and K
is the average length of the words. –
Tay © 2022 - 2024 — McMap. All rights reserved.