Scrabble word finder: building a trie, storing a trie, using a trie?
Asked Answered
B

1

7

What I’m trying to do:

  • Build a mobile web application where the user can get help finding words to play when playing scrabble
  • Users get word suggestions by typing in any amount of letters and 0 or more wildcards

How I’m trying to do this:

  • Using MySQL database with a dictionary containing over 400k words
  • Using ASP.NET with C# as server-side programming language
  • Using HTML5, CSS and Javascript

My current plan:

  • Building a Trie with all the words from the database so I can do a fast and accurate search for words depending on user letter/wildcard input

Having a plan is no good if you can’t execute it, this is what I need help with:

  • How do I build a Trie from the database? (UPDATE: I want to generate a Trie using the words already in my database, after that's done I'm not going to use the database for word matching any more)
  • How do I store the Trie for fast and easy access? (UPDATE: So I can trash my database)
  • How do I use C# to search for words using the Trie depending on letters and wildcards?

Finally:
Any help is very much appreciated, I’m still a beginner with C# and MySQL so please be gentle

Thank you a lot!

Bors answered 16/9, 2011 at 10:48 Comment(1)
Note that this is a follow-up question to #7419410Septima
S
17

First off, let's look at the constraints on the problem. You want to store a word list for a game in a data structure that efficiently supports the "anagram" problem. That is, given a "rack" of n letters, what are all the n-or-fewer-letter words in the word list that can be made from that rack. the word list will be about 400K words, and so is probably about one to ten megs of string data when uncompressed.

A trie is the classic data structure used to solve this problem because it combines both memory efficiency with search efficiency. With a word list of about 400K words of reasonable length you should be able to keep the trie in memory. (As opposed to going with a b-tree sort of solution where you keep most of the tree on disk because it is too big to fit in memory all at once.)

A trie is basically nothing more than a 26-ary tree (assuming you're using the Roman alphabet) where every node has a letter and one additional bit on each node that says whether it is the end of the word.

So let's sketch the data structure:

class TrieNode
{
    char Letter;
    bool IsEndOfWord;
    List<TrieNode> children; 
}

This of course is just a sketch; you'd probably want to make these have proper property accessors and constructors and whatnot. Also, maybe a flat list is not the best data structure; maybe some sort of dictionary is better. My advice is to get it working first, and then measure its performance, and if it is unacceptable, then experiment with making changes to improve its performance.

You can start with an empty trie:

TrieNode root = new TrieNode('^', false, new List<TrieNode>());

That is, this is the "root" trie node that represents the beginning of a word.

How do you add the word "AA", the first word in the Scrabble dictionary? Well, first make a node for the first letter:

root.Children.Add('A', false, new List<TrieNode>());

OK, our trie is now

^
|
A

Now add a node for the second letter:

root.Children[0].Children.Add(new trieNode('A', true, new List<TrieNode>()));

Our trie is now

^
|
A
|
A$   -- we notate the end of word flag with $

Great. Now suppose we want to add AB. We already have a node for "A", so add to it the "B$" node:

root.Children[0].Children.Add(new trieNode('B', true, new List<TrieNode>());

and now we have

    ^
    |
    A
   / \
  A$   B$

Keep on going like that. Of course, rather than writing "root.Children[0]..." you'll write a loop that searches the trie to see if the node you want exists, and if not, create it.

To store your trie on disk -- frankly, I would just store the word list as a plain text file and rebuild the trie when you need to. It shouldn't take more than 30 seconds or so, and then you can re-use the trie in memory. If you do want to store the trie in some format that is more like a trie, it shouldn't be hard to come up with a serialization format.

To search the trie for matching a rack, the idea is to explore every part of the trie, but to prune out the areas where the rack cannot possibly match. If you haven't got any "A"s on the rack, there is no need to go down any "A" node. I sketched out the search algorithm in your previous question.

I've got an implementation of a functional-style persistent trie that I've been meaning to blog about for a while but never got around to it. If I do eventually post that I'll update this question.

Septima answered 16/9, 2011 at 15:13 Comment(12)
I'm not clear what the question was but this is a clear explanation of a Trie +1 Now if you could OCR the scrabble board through the mobile camera ...Atwood
Reading the post made me think of Huffman Coding, it maybe overkill but since the dictionary is largely fixed would a Huffman Tree be a sensible way to store the data?Atwood
@Jodrell: A Huffman tree is different, for a different purpose, but I can see why you are reminded of Huffman trees here.Sextuple
I'm waiting for this blog post ;)Lazaro
@Carsten: I'm working on it. I have the code working -- that just takes a few minutes -- but I'm trying to figure out what is the most pedagogically interesting and yet still performant way to construct an n-ary tree where > 300K of the tree nodes are 0-ary or 1-ary and one of the nodes is 26-ary. If I can make it immutable and persistent too, that would be great. I've tried an immutable binary vector but that's a bit slow.Septima
hey no problem - take your time - just wanted to encourage you on it ;) (as I reread it now I can see that this might be interpreted as "OMG where is my article" - have merci with a non-english guy)Lazaro
@Carsten: No worries! Mr Smiley indicated that you were joshing.Septima
"but I'm trying to figure out what is the most pedagogically interesting and yet still performant way to construct an n-ary tree where > 300K..." Reading that caused me to immediately start trying to figure out a way to do it. My first thought is to just have each node store a string. So, if a node has only one child, it is "compressed" by storing that child as the next character in the string. The price you pay for doing this is that strings are heavier than characters, but this also reduces the number of internal nodes. Not sure if the tradeoff is good. Needs thought/testing.Sandbag
@Brian: Indeed, that's a reasonably good technique if the lexicon has a lot of words that have "runs" of characters in them that do not have any "special" structure to them. If, say, "TCHSTICK" appears only once in the lexicon, say, at the end of "MATCHSTICK", then a "TCHSTICK" can be a child of the M-A node, without having to allocate nine additional nodes for T-C-H-S-T-I-C-K-$. (And "name a common English word that has TCHST in the middle of it" is a good brain teaser at parties.)Septima
I was wondering, would there be an advantage to storing the words (or building the trie) in reverse. I have an unsubstantiated idea that word endings are more common than beginnings.Atwood
@Brian, that is commonly called a patricia trie or radix trie. en.wikipedia.org/wiki/Radix_treeShiite
@Jodrell: I think what you want is actually a dawg. That would also be more pedagogically interesting!Comras

© 2022 - 2024 — McMap. All rights reserved.