Algorithm to get a list of all words that are anagrams of all substrings (scrabble)?
Asked Answered
D

8

12

Eg if input string is helloworld I want the output to be like:

do
he
we
low
hell
hold
roll
well
word
hello
lower
world
...

all the way up to the longest word that is an anagram of a substring of helloworld. Like in Scrabble for example. The input string can be any length, but rarely more than 16 chars.

I've done a search and come up with structures like a trie, but I am still unsure of how to actually do this.

Daumier answered 19/5, 2009 at 1:16 Comment(3)
how come "do" is a valid string in the above case? It is not an anagram of any substring ryt?Bookplate
@Bookplate "do" is valid because the letters 'd' and 'o' are in the input string.Daumier
This is similar to the problem of finding substrings that are permutations of a string.Dreyfus
P
14

The structure used to hold your dictionary of valid entries will have a huge impact on efficiency. Organize it as a tree, root being the singular zero letter "word", the empty string. Each child of root is a single first letter of a possible word, children of those being the second letter of a possible word, etc., with each node marked as to whether it actually forms a word or not.

Your tester function will be recursive. It starts with zero letters, finds from the tree of valid entries that "" isn't a word but it does have children, so you call your tester recursively with your start word (of no letters) appended with each available remaining letter from your input string (which is all of them at that point). Check each one-letter entry in tree, if valid make note; if children, re-call tester function appending each of remaining available letters, and so on.

So for example, if your input string is "helloworld", you're going to first call your recursive tester function with "", passing the remaining available letters "helloworld" as a 2nd parameter. Function sees that "" isn't a word, but child "h" does exist. So it calls itself with "h", and "elloworld". Function sees that "h" isn't a word, but child "e" exists. So it calls itself with "he" and "lloworld". Function sees that "e" is marked, so "he" is a word, take note. Further, child "l" exists, so next call is "hel" with "loworld". It will next find "hell", then "hello", then will have to back out and probably next find "hollow", before backing all the way out to the empty string again and then starting with "e" words next.

Parthena answered 19/5, 2009 at 1:38 Comment(5)
I like this explanation. I can (almost) get my head around it. Need to do a pen and paper analysis I think.Daumier
Looks like it's pretty close to Norman Ramsey's DAWG (other post); should have known there was a formal definition for it.Parthena
Nope...I've thought about this and decided I don't understand the tester function. I understand how to build the tree. How does it get from "hello" to "hollow"? Recursion was never my strong point!Daumier
It won't get from "hello" to "hollow" directly. The call that received params of "h" and "elloworld" will subsequently make calls to all child nodes, which will be all 2-letter combinations which start words ("he" & "lloworld"; "ho" & "ellworld"; but not "hl" & anything, because that child will not exist). The call through "he" etc. will get to "hello", & eventually that recursion will exhaust itself. Then from "h" & "elloworld" again you'll call "ho" & "ellworld", and that call will eventually recursively reach "hollow"Parthena
How can this be done using a SQL database? The size of dictionary I want to use can't be handled by the language I'm using (PHP).Ailey
D
9

I couldn't resist my own implementation. It creates a dictionary by sorting all the letters alphabetically, and mapping them to the words that can be created from them. This is an O(n) start-up operation that eliminates the need to find all permutations. You could implement the dictionary as a trie in another language to attain faster speedups.

The "getAnagrams" command is also an O(n) operation which searches each word in the dictionary to see if it is a subset of the search. Doing getAnagrams("radiotelegraphically")" (a 20 letter word) took approximately 1 second on my laptop, and returned 1496 anagrams.

# Using the 38617 word dictionary at 
# http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt
# Usage: getAnagrams("helloworld")

def containsLetters(subword, word):
    wordlen = len(word)
    subwordlen = len(subword)

    if subwordlen > wordlen:
        return False

    word = list(word)
    for c in subword:
        try:
            index = word.index(c)
        except ValueError:
            return False
        word.pop(index)
    return True

def getAnagrams(word):
    output = []
    for key in mydict.iterkeys():
        if containsLetters(key, word):
            output.extend(mydict[key])

    output.sort(key=len)
    return output

f = open("dict.txt")
wordlist = f.readlines()
f.close()

mydict = {}
for word in wordlist:
    word = word.rstrip()
    temp = list(word)
    temp.sort()
    letters = ''.join(temp)

    if letters in mydict:
        mydict[letters].append(word)
    else:
        mydict[letters] = [word]

An example run:

>>> getAnagrams("helloworld")
>>> ['do', 'he', 'we', 're', 'oh', 'or', 'row', 'hew', 'her', 'hoe', 'woo', 'red', 'dew', 'led', 'doe', 'ode', 'low', 'owl', 'rod', 'old', 'how', 'who', 'rho', 'ore', 'roe', 'owe', 'woe', 'hero', 'wood', 'door', 'odor', 'hold', 'well', 'owed', 'dell', 'dole', 'lewd', 'weld', 'doer', 'redo', 'rode', 'howl', 'hole', 'hell', 'drew', 'word', 'roll', 'wore', 'wool','herd', 'held', 'lore', 'role', 'lord', 'doll', 'hood', 'whore', 'rowed', 'wooed', 'whorl', 'world', 'older', 'dowel', 'horde', 'droll', 'drool', 'dwell', 'holed', 'lower', 'hello', 'wooer', 'rodeo', 'whole', 'hollow', 'howler', 'rolled', 'howled', 'holder', 'hollowed']
Disarm answered 19/5, 2009 at 7:13 Comment(3)
This certainly gets the job done! But John Pirie's method is possibly more efficient? I'll play around with both.Daumier
@ 20th Century Boy: John Pirie describes some kind of trie, which I already suggested as a possible replacement to a hash table. However, it sounds incomplete. He describes walking the trie with H-e, which gives the word "he", but neglects the permutation "eh" (I think that's a word).Disarm
@Unknown: I think "eh" would get detected after all the "h" words, ie when the function backtracks to the top of the tree and moves onto the branch starting with "e".Daumier
J
6

The data structure you want is called a Directed Acyclic Word Graph (dawg), and it is described by Andrew Appel and Guy Jacobsen in their paper "The World's Fastest Scrabble Program" which unfortunately they have chosen not to make available free online. An ACM membership or a university library will get it for you.

I have implemented this data structure in at least two languages---it is simple, easy to implement, and very, very fast.

Jaredjarek answered 19/5, 2009 at 2:25 Comment(4)
I found it on the first page of a Google search :-)Daumier
There is perhaps a faster one... see ericsink.com/downloads/faster-scrabble-gordon.pdfTeetotal
thanks @Norman and Steve to point out those amazing resources :)Capricecapricious
The GADDAG is a faster algorithm for generating Scrabble moves, but not necessarily for just finding anagrams/subanagrams. They should be comparable, except that the DAWG might be faster since it's small enough to fit in cache memory.Infatuation
R
2

What you want is an implementation of a power set.

Also look at Eric Lipparts blog, he blogged about this very thing a little while back

EDIT:

Here is an implementation I wrote of getting the powerset from a given string...

private IEnumerable<string> GetPowerSet(string letters)
{
  char[] letterArray = letters.ToCharArray();
  for (int i = 0; i < Math.Pow(2.0, letterArray.Length); i++)
  {
    StringBuilder sb = new StringBuilder();
    for (int j = 0; j < letterArray.Length; j++)
    {
      int pos = Convert.ToInt32(Math.Pow(2.0, j));
      if ((pos & i) == pos)
      {
        sb.Append(letterArray[j]);
      }
    }
    yield return new string(sb.ToString().ToCharArray().OrderBy(c => c).ToArray());
  }
}

This function gives me the powersets of chars that make up the passed in string, I then can use these as keys into a dictionary of anagrams...

Dictionary<string,IEnumerable<string>>

I created my dictionary of anagrams like so... (there are probably more efficient ways, but this was simple and plenty quick enough with the scrabble tournament word list)

wordlist = (from s in fileText.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
                let k = new string(s.ToCharArray().OrderBy(c => c).ToArray())
                group s by k).ToDictionary(o => o.Key, sl => sl.Select(a => a));
Rudnick answered 19/5, 2009 at 1:26 Comment(0)
O
2

A simple-minded approach is to generate all the "substrings" and, for each of them, check whether it's an element of the set of acceptable words. E.g., in Python 2.6:

import itertools
import urllib

def words():
  f = urllib.urlopen(
    'http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt')
  allwords = set(w[:-1] for w in f)
  f.close()
  return allwords

def substrings(s):
  for i in range(2, len(s)+1):
    for p in itertools.permutations(s, i):
      yield ''.join(p)

def main():
  w = words()
  print '%d words' % len(w)
  ss = set(substrings('weep'))
  print '%d substrings' % len(ss)
  good = ss & w
  print '%d good ones' % len(good)
  sgood = sorted(good, key=lambda w:(len(w), w))
  for aword in sgood:
    print aword

main()

will emit:

38617 words
31 substrings
5 good ones
we
ewe
pew
wee
weep

Of course, as other responses pointed out, organizing your data purposefully can greatly speed-up your runtime -- although the best data organization for a fast anagram finder could well be different... but that will largely depend on the nature of your dictionary of allowed words (a few tens of thousands, like here -- or millions?). Hash-maps and "signatures" (based on sorting the letters in each word) should be considered, as well as tries &c.

Ottoman answered 19/5, 2009 at 1:50 Comment(1)
This works for a small test string, but a 16-character one yields 20922789888000 possible substrings. For long test strings, you want to be bound by valid entries, rather than possible letter combinations.Parthena
E
0

Like Tim J, Eric Lippert's blog posts where the first thing to come to my mind. I wanted to add that he wrote a follow-up about ways to improve the performance of his first attempt.

Enthetic answered 19/5, 2009 at 1:41 Comment(0)
H
0

I believe the Ruby code in the answers to this question will also solve your problem.

Hydrogenous answered 19/5, 2009 at 3:28 Comment(0)
G
0

I've been playing a lot of Wordfeud on my phone recently and was curious if I could come up with some code to give me a list of possible words. The following code takes your availble source letters (* for a wildcards) and an array with a master list of allowable words (TWL, SOWPODS, etc) and generates a list of matches. It does this by trying to build each word in the master list from your source letters.

I found this topic after writing my code, and it's definitely not as efficient as John Pirie's method or the DAWG algorithm, but it's still pretty quick.

public IList<string> Matches(string sourceLetters, string [] wordList)
{
    sourceLetters = sourceLetters.ToUpper();

    IList<string> matches = new List<string>();

    foreach (string word in wordList)
    {
        if (WordCanBeBuiltFromSourceLetters(word, sourceLetters))
            matches.Add(word);
    }

    return matches;
}


public bool WordCanBeBuiltFromSourceLetters(string targetWord, string sourceLetters)
{
    string builtWord = "";

    foreach (char letter in targetWord)
    {
        int pos = sourceLetters.IndexOf(letter);
        if (pos >= 0)
        {
            builtWord += letter;
            sourceLetters = sourceLetters.Remove(pos, 1);
            continue;
        }


        // check for wildcard
        pos = sourceLetters.IndexOf("*");
        if (pos >= 0)
        {
            builtWord += letter;
            sourceLetters = sourceLetters.Remove(pos, 1);
        }


    }

    return string.Equals(builtWord, targetWord);

}
Galumph answered 26/1, 2011 at 21:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.