Efficient hunting for words in scrambled letters
Asked Answered
S

10

9

I guess you could classify this as a Scrabble style problem, but it started out due to a friend mentioning the UK TV quiz show Countdown. Various rounds in the show involve the contestants being presented a scrambled set of letters and they have to come up with the longest word they can. The one my friend mentioned was "RAEPKWAEN".

In fairly short order I whipped up something in Python to handle this problem, using PyEnchant to handle the dictionary look-ups, however I'm noticing that it really can't scale all that well.

Here's what I have currently:

#!/usr/bin/python

from itertools import permutations
import enchant
from sys import argv

def find_longest(origin):
    s = enchant.Dict("en_US")
    for i in range(len(origin),0,-1):
        print "Checking against words of length %d" % i
        pool = permutations(origin,i)
        for comb in pool:
            word = ''.join(comb)
            if s.check(word):
                return word
    return ""

if (__name__)== '__main__':
    result = find_longest(argv[1])
    print result

That's fine on a 9 letter example like they use in the show, 9 factorial = 362,880 and 8 factorial = 40,320. On that scale even if it would have to check all possible permutations and word lengths it's not that many.

However once you reach 14 characters that's 87,178,291,200 possibly combinations, meaning you're reliant on luck that a 14 character word is quickly found.

With the example word above it's taking my machine about 12 1/2 seconds to find "reawaken". With 14 character scrambled words we could be talking on the scale of 23 days just to check all possible 14 character permutations.

Is there any more efficient way to handle this?

Stature answered 19/1, 2012 at 10:4 Comment(5)
I'm not sure how efficient enchant is. But is it possible to find a list with all words that are n characters long? If so, you could load that list in memory and do a in instead of a enchant.check. I guess it is fast for long words. But the list will be too long for short words.Goa
@Willian, post this comment as an answer in order to vote up your approach: Insteat to catch for permutations in dictionary, check for all dictionary words letters in valid letters.Neolith
Fair point @Willian, I had also written this with python-aspell bindings. It does look like Enchant is a bottleneck, the aspell version takes a lot less time (about half), however it's still going to take way too long brute force! Thanks everyone for your answers, some interesting ideas there. I'll try implementing them and see what sort of speed differences we'll see.Stature
RAEPKWAEN? Where have I seen that collection of letters before? bbc.co.uk/news/entertainment-arts-16627873Procedure
This question has generated some really good responses. +1Phyle
I
6

Implementation of Jeroen Coupé idea from his answer with letters count:

from collections import defaultdict, Counter


def find_longest(origin, known_words):
    return iter_longest(origin, known_words).next()

def iter_longest(origin, known_words, min_length=1):
    origin_map = Counter(origin)
    for i in xrange(len(origin) + 1, min_length - 1, -1):
        for word in known_words[i]:
            if check_same_letters(origin_map, word):
               yield word

def check_same_letters(origin_map, word):
    new_map = Counter(word)
    return all(new_map[let] <= origin_map[let] for let in word)

def load_words_from(file_path):
    known_words =  defaultdict(list)
    with open(file_path) as f:
        for line in f:
            word = line.strip()
            known_words[len(word)].append(word)
    return known_words

if __name__ == '__main__':
    known_words = load_words_from('words_list.txt')
    origin = 'raepkwaen'
    big_origin = 'raepkwaenaqwertyuiopasdfghjklzxcvbnmqwertyuiopasdfghjklzxcvbnmqwertyuiopasdfghjklzxcvbnmqwertyuiopasdfghjklzxcvbnm'
    print find_longest(big_origin, known_words)
    print list(iter_longest(origin, known_words, 5))

Output (for my small 58000 words dict):

counterrevolutionaries
['reawaken', 'awaken', 'enwrap', 'weaken', 'weaker', 'apnea', 'arena', 'awake',
 'aware', 'newer', 'paean', 'parka', 'pekan', 'prank', 'prawn', 'preen', 'renew',
 'waken', 'wreak']

Notes:

  • It's simple implementation without optimizations.

  • words_list.txt - can be /usr/share/dict/words on Linux.

UPDATE

In case we need to find word only once, and we have dictionary with words sorted by length, e.g. by this script:

with open('words_list.txt') as f:
    words = f.readlines()
with open('words_by_len.txt', 'w') as f:
    for word in sorted(words, key=lambda w: len(w), reverse=True):
        f.write(word)

We can find longest word without loading full dict to memory:

from collections import Counter
import sys


def check_same_letters(origin_map, word):
    new_map = Counter(word)
    return all(new_map[let] <= origin_map[let] for let in word)

def iter_longest_from_file(origin, file_path, min_length=1):
    origin_map = Counter(origin)
    origin_len = len(origin)
    with open(file_path) as f:
        for line in f:
            word = line.strip()
            if len(word) > origin_len:
                continue
            if len(word) < min_length:
                return
            if check_same_letters(origin_map, word):
                yield word

def find_longest_from_file(origin, file_path):
    return iter_longest_from_file(origin, file_path).next()

if __name__ == '__main__':
    origin = sys.argv[1] if len(sys.argv) > 1 else 'abcdefghijklmnopqrstuvwxyz'
    print find_longest_from_file(origin, 'words_by_len.txt')
Irresolvable answered 19/1, 2012 at 15:50 Comment(0)
C
4

You want to avoid doing the permutation. You could count how many times a character appears in both strings ( the original string and the one from the dictionary). Dismiss all the words from the dictionary where the frequency of characters isn't the same.

So to check one word from the dictionary you will need to count the characters at most MAX (26, n) time.

Cache answered 19/1, 2012 at 10:10 Comment(4)
Yup. This is the way, except for you're looking for greater-or-equal frequency in the original. Given roughly 1,000,000 english words, we're looking at probably ~50-60mb of memory for the dictionary and approximately ~27 million calculations. Nice! More complicated data structures and pre-processing could improve upon this even more.Flashover
@Flashover why put the dictionary in memory? it's a linear search, you can do it by scanning a file.Beset
one simple optimization would be to sort the dictionary file by word length (O(nlogn) but done only once) and start a search at words at most as long as the searched one.Beset
I wrote a program like this once. Easiest thing is to load the entire dictionary into a python dict, where each word is keyed by its sorted letters. That allows you to check for a permuation in O(1): if sorted(word) in dict. But for different lengths, you need to do combinations, using a naive algorithm this is O(2^n) though there may be a better way.Nursery
W
1
  1. Pre-parse the dictionary as sorted(word), word pairs. (e.g. giilnstu, linguist)
  2. Sort the dictionary file.

Then, when you are searching for a given set of letters:

  1. Binary search the dictionary for the letters you have, sorting the letters first.

You'd need to do this separately for each word length.

EDIT: should say that you're searching for all unique combinations of the sorted letters of the target word length (range(len(letters), 0, -1))

Wexford answered 19/1, 2012 at 13:21 Comment(0)
P
1

This is similar to an anagram problem I've worked on before. I solved that by using prime numbers to represent each letter. The product of the letters for each word produces a number. To determine if a given set of input characters are sufficient to make a work, just divide the product of the input character by the product for the number you want to check. If there is no remainder then the input characters are sufficient. I've implemented it below. The output is:

$ python longest.py rasdaddea aosddna raepkwaen
rasdaddea -->  sadder
aosddna -->  soda
raepkwaen -->  reawaken

You can find more details and a thorough explanation of the anagrams case at: http://mostlyhighperformance.blogspot.com/2012/01/generating-anagrams-efficient-and-easy.html

This algorithm takes a small amount of time to set up a dictionary, and then individual checks are as easy as a single division for every word in the dictionary. There may be faster methods that rely on closing off parts of the dictionary if it lacks a letter, but these may end up performing worse if you have large number of input letters so it is actually not able to close off any part of the dictionary.

import sys


def nextprime(x):
    while True:
        x += 1
        for pot_fac in range(2,x):
            if x % pot_fac == 0:
                break
        else:
            return x

def prime_generator():
    '''Returns a generator that produces the next largest prime as
    compared to the one returned from this function the last time
    it was called. The first time it is called it will return 2.'''
    lastprime = 1
    while True:
        lastprime = nextprime(lastprime)
        yield lastprime


# Assign prime numbers to each lower case letter
gen = prime_generator()
primes = dict( [ (chr(x),gen.next()) for x in range(ord('a'),ord('z')+1) ] )


product = lambda x: reduce( lambda m,n: m*n, x, 1 )
make_key = lambda x: product( [ primes[y] for y in x ] )


try:
    words = open('words').readlines()
    words = [ ''.join( [ c for c in x.lower() \
                if ord('a') <= ord(c) <= ord('z') ] ) \
            for x in words ]
    for x in words:
        try:
            make_key(x)
        except:
            print x
            raise

except IOError:
    words = [ 'reawaken','awaken','enwrap','weaken','weaker', ]

words = dict( ( (make_key(x),x,) for x in words ) )


inputs = sys.argv[1:] if sys.argv[1:] else [ 'raepkwaen', ]
for input in inputs:
    input_key = make_key(input)
    results = [ words[x] for x in words if input_key % x == 0 ]

    result = reversed(sorted(results, key=len)).next()
    print input,'--> ',result
Prefatory answered 19/1, 2012 at 16:44 Comment(2)
Im fairly certain your nextprime() function could for pot_fac in xrange(2, x/2), or even math.sqrt(x)Phyle
@Droogans - yes, there are many prime optimizations available. Since I'm only taking 26, I could just list them. I was going for readability rather than speed for the prime initializations. The best list of pot_fac to use would be all the previously returned primes up to floor(sqrt(x)).Prefatory
F
1

I started this last night shortly after you asked the question, but didn't get around to polishing it up until just now. This was my solution, which is basically a modified trie, which I didn't know until today!

class Node(object):
    __slots__ = ('words', 'letter', 'child', 'sib')

    def __init__(self, letter, sib=None):
        self.words = []
        self.letter = letter
        self.child = None
        self.sib = sib

    def get_child(self, letter, create=False):
        child = self.child
        if not child or child.letter > letter:
            if create:
                self.child = Node(letter, child)
                return self.child
            return None
        return child.get_sibling(letter, create)

    def get_sibling(self, letter, create=False):
        node = self
        while node:
            if node.letter == letter:
                return node
            sib = node.sib
            if not sib or sib.letter > letter:
                if create:
                    node.sib = Node(letter, sib)
                    node = node.sib
                    return node
                return None
            node = sib
        return None

    def __repr__(self):
        return '<Node({}){}{}: {}>'.format(chr(self.letter), 'C' if self.child else '', 'S' if self.sib else '', self.words)

def add_word(root, word):
    word = word.lower().strip()
    letters = [ord(c) for c in sorted(word)]
    node = root
    for letter in letters:
        node = node.get_child(letter, True)
    node.words.append(word)

def find_max_word(root, word):
    word = word.lower().strip()
    letters = [ord(c) for c in sorted(word)]
    words = []
    def grab_words(root, letters):
        last = None
        for idx, letter in enumerate(letters):
            if letter == last: # prevents duplication
                continue
            node = root.get_child(letter)
            if node:
                words.extend(node.words)
                grab_words(node, letters[idx+1:])
            last = letter
    grab_words(root, letters)
    return words

root = Node(0)
with open('/path/to/dict/file', 'rt') as f:
    for word in f:
        add_word(root, word)

Testing:

>>> def nonrepeating_words():
...     return find_max_word(root, 'abcdefghijklmnopqrstuvwxyz')
... 
>>> sorted(nonrepeating_words(), key=len)[-10:]
['ambidextrously', 'troublemakings', 'dermatoglyphic', 'hydromagnetics', 'hydropneumatic', 'pyruvaldoxines', 'hyperabductions', 'uncopyrightable', 'dermatoglyphics', 'endolymphaticus']
>>> len(nonrepeating_words())
67590

I think I prefer dermatoglyphics to uncopyrightable for longest word, myself. Performance-wise, utilizing a ~500k word dictionary (from here),

>>> import timeit
>>> timeit.timeit(nonrepeating_words, number=100)
62.8912091255188
>>> 

So, on average, 6/10ths of a second (on my i5-2500) to find all sixty-seven thousand words that contain no repeating letters.

The big differences between this implementation and a trie (which makes it even further from a DAWG in general) is that: words are stored in the trie in relation to their sorted letters. So the word 'dog' is stored under the same path as 'god': d-g-o. The second bit is the the find_max_word algorithm, which makes sure every possible letter combination is visited by continually lopping off its head and re-running the search.

Oh, and just for giggles:

>>> sorted(tree.find_max_word('RAEPKWAEN'), key=len)[-5:]
['wakener', 'rewaken', 'reawake', 'reawaken', 'awakener']
Flashover answered 20/1, 2012 at 10:15 Comment(0)
P
1

Another approach, similar to @market's answer, is to precompute a 'bitmask' for each word in the dictionary. Bit 0 is set if the word contains at least one A, bit 1 is set if it contains at least one B, and so on up to bit 25 for Z.

If you want to search for all words in the dictionary that could be made up from a combination of letters, you start by forming the bitmask for the collection of letters. You can then filter out all of the words that use other letters by checking whether wordBitmask & ~lettersBitMask is zero. If this is zero, the word only uses letters available in the collection, and so could be valid. If this is non-zero, it uses a letter not available in the collection and so is not allowed.

The advantage of this approach is that the bitwise operations are fast. The vast majority of words in the dictionary will use at least one of the 17 or more letters that aren't in the collection given, and you can speedily discount them all. However, for the minority of words that make it through the filter, there is one more check that you still have to make. You still need to check that words aren't using letters more often than they appear in the collection. For example, the word 'weakener' must be disallowed because it has three 'e's, whereas there are only two in the collection of letters RAEPKWAEN. The bitwise approach alone will not filter out this word since each letter in the word appears in the collection.

Procedure answered 20/1, 2012 at 11:5 Comment(0)
O
0
  1. Construct a trie (prefix tree) from your dictionary. You may want to cache it.
  2. Walk on this trie and remove whole branches that do not fit your bag of letters.

At this point, your trie is the representation of all words in your dictionary that can be constructed from your bag of letters.

  1. Just take the longer one(s) :-)

Edit: you may also use a DAGW (Directed Acyclic Word Graph) which will have fewer vertices. Although I haven't read it, this wikipedia article have a link about The World's Fastest Scrabble Program.

Opheliaophelie answered 19/1, 2012 at 13:53 Comment(0)
F
0

When looking for words longer than 10 letters you may try to iterate over words (I think there are not so many words with 10 letters) that are longer than 10 letters and check it you have required letters in your set.

Problem is that you have to find all those len(word) >= 10 words first.

So, what I would do: When reading the dictionary split the words into 2 categories: shorts and longs. You can process shorts by iterating over every possible permutation. Than you can process longs by iterating over then and checking it they are possible.

Of course there are many optimisations possible to both paths.

Fou answered 19/1, 2012 at 14:6 Comment(0)
N
0

DAWG (Directed Acyclic Word Graph) Mark Wutka was kind enough to provide some pascal code here.

Niobe answered 19/1, 2012 at 22:0 Comment(0)
W
0

In case you have a text file with sorted words. Simply this code does the math:

UsrWrd = input()      #here you Enter scrambled letters
with open('words.db','r') as f:
   for Line in f:
       for Word in Line.split():
           if len(Word) == len(UsrWrd) and set(Word) == set(UsrWrd):
               print(Word)
               break
           else:continue       `
Wagstaff answered 29/4, 2018 at 18:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.