scrabble solving with maximum score

Asked 28/4, 2015 at 11:35 Answered 29/4, 2015 at 0:58

algorithm dynamic-programming string-matching backtracking trie

I was asked a question

You are given a list of characters, a score associated with each character and a dictionary of valid words ( say normal English dictionary ). you have to form a word out of the character list such that the score is maximum and the word is valid.

I could think of a solution involving a trie made out of dictionary and backtracking with available characters, but could not formulate properly. Does anyone know the correct approach or come up with one?

Leclair answered 28/4, 2015 at 11:35 Comment(4)

If the dictionary is of a reasonable length, I might just brute force it. However I assume thats not the answer you're looking for. – Anagnorisis 28/4, 2015 at 11:40

With preprocessing the dictionary it is relatively obvious and without it ... I can't see a solution (but I may be wrong) – Ascham 28/4, 2015 at 11:44

Yeah am looking for a algo that could work with a quite large dictionary.. what kind of pre-processing with dictionary, can you elaborate? – Leclair 28/4, 2015 at 12:45

Without preprocessing, there is little better that you can do other than scanning the whole dictionary. Is preprocessing of the dictionary and/or the scores allowed ? – Striker 28/4, 2015 at 16:21

First iterate over your letters and count how many times do you have each of the characters in the English alphabet. Store this in a static, say a char array of size 26 where first cell corresponds to a second to b and so on. Name this original array cnt. Now iterate over all words and for each word form a similar array of size 26. For each of the cells in this array check if you have at least as many occurrences in cnt. If that is the case, you can write the word otherwise you can't. If you can write the word you compute its score and maximize the score in a helper variable.

This approach will have linear complexity and this is also the best asymptotic complexity you can possibly have(after all the input you're given is of linear size).

Adame answered 28/4, 2015 at 11:57 Comment(4)

Since nearly all words are shorter than 26 characters, it will probably be faster not to actually build complete histograms for the dictionary words. Instead, make the histogram cnt for the query word, then make an all-zero histogram h, and for each word in the dict, ++h[c] for each letter c, halting if h[c] > cnt[c]. Either way, zero h back out faster by looping through the letters of the word again, --h[c]ing for each. Separately, it's possible to go faster if many queries will be made on the same dict. – Sokul 28/4, 2015 at 12:37

An additional possible optimization is to break as soon as h[c] is more than cnt[c] instead of iterating over the whole word. However the asymptotic complexity will remain the same even with all these optimizations. – Adame 28/4, 2015 at 12:39

Right, it will remain the same if we assume that alphabet size is a constant. Which is pretty standard, but here I think it's just as reasonable to assume that max word length is bounded by a constant, and in that case your solution is the same complexity as brute force. – Sokul 28/4, 2015 at 12:47

Yeah I gave this solution too but I was supposed to come up with an even better solution – Leclair 28/4, 2015 at 12:48

Inspired by Programmer Person's answer (initially I thought that approach was O(n!) so I discarded it). It needs O(nr of words) setup and then O(2^(chars in query)) for each question. This is exponential, but in Scrabble you only have 7 letter tiles at a time; so you need to check only 128 possibilities!

First observation is that the order of characters in query or word doesn't matter, so you want to process your list into a set of bag of chars. A way to do that is to 'sort' the word so "bac", "cab" become "abc".

Now you take your query, and iterate all possible answers. All variants of keep/discard for each letter. It's easier to see in binary form: 1111 to keep all, 1110 to discard the last letter...

Then check if each possibility exists in your dictionary (hash map for simplicity), then return the one with the maximum score.

import nltk
from string import ascii_lowercase
from itertools import product

scores = {c:s for s, c in enumerate(ascii_lowercase)}
sanitize = lambda w: "".join(c for c in w.lower() if c in scores)
anagram = lambda w: "".join(sorted(w))

anagrams = {anagram(sanitize(w)):w for w in nltk.corpus.words.words()}

while True:
    query = input("What do you have?")
    if not query: break

    # make it look like our preprocessed word list
    query = anagram(sanitize(query))

    results = {}

    # all variants for our query
    for mask in product((True, False), repeat=len(query)):
        # get the variant given the mask
        masked = "".join(c for i, c in enumerate(query) if mask[i])
        # check if it's valid
        if masked in anagrams:
            # score it, also getting the word back would be nice
            results[sum(scores[c] for c in masked)] = anagrams[masked]

    print(*max(results.items()))

Megilp answered 29/4, 2015 at 0:37 Comment(0)

Build a lookup trie of just the sorted-anagram of each word of the dictionary. This is a one time cost.

By sorted anagram I mean: if the word is eat you represent it as aet. It the word is tea, you represent it as aet, bubble is represent as bbbelu etc

Since this is scrabble, assuming you have 8 tiles (say you want to use one from the board), you will need to maximum check 2^8 possibilities.

For any subset of the tiles from the set of 8, you sort the tiles, and lookup in the anagram trie.

There are at most 2^8 such subsets, and this could potentially be optimized (in case of repeating tiles) by doing a more clever subset generation.

If this is a more general problem, where 2^{number of tiles} could be much higher than the total number of anagram-words in the dictionary, it might be better to use frequency counts as in Ivaylo's answer, and the lookups potentially can be optimized using multi-dimensional range queries. (In this case 26 dimensions!)

Sorry, this might not help you as-is (I presume you are trying to do some exercise and have constraints), but I hope this will help the future readers who don't have those constraints.

Leilaleilah answered 28/4, 2015 at 14:7 Comment(0)

If the number of dictionary entries is relatively small (up to a few million) you can use brute force: For each word, create a 32 bit mask. Preprocess the data: Set one bit if the letter a/b/c/.../z is used. For the six most common English characters etaoin set another bit if the letter is used twice.

Create a similar bitmap for the letters that you have. Then scan the dictionary for words where all bits that are needed for the word are set in the bitmap for the available letters. You have reduced the problem to words where you have all needed characters once, and the six most common characters twice if the are needed twice. You'll still have to check if a word can be formed in case you have a word like "bubble" and the first test only tells you that you have letters b,u,l,e but not necessarily 3 b's.

By also sorting the list of words by point values before doing the check, the first hit is the best one. This has another advantage: You can count the points that you have, and don't bother checking words with more points. For example, bubble has 12 points. If you have only 11 points, then there is no need to check this word at all (have a small table with the indexes of the first word with any given number of points).

To improve anagrams: In the table, only store different bitmasks with equal number of points (so we would have entries for bubble and blue because they have different point values, but not for team and mate). Then store all the possible words, possibly more than one, for each bit mask and check them all. This should reduce the number of bit masks to check.

Coffeng answered 29/4, 2015 at 0:58 Comment(0)

Here is a brute force approach in python, using an english dictionary containing 58,109 words. This approach is actually quite fast timing at about .3 seconds on each run.

from random import shuffle
from string import ascii_lowercase
import time

def getValue(word):
    return sum(map( lambda x: key[x], word))

if __name__ == '__main__':
    v = range(26)
    shuffle(v)
    key = dict(zip(list(ascii_lowercase), v))

    with open("/Users/james_gaddis/PycharmProjects/Unpack Sentance/hard/words.txt", 'r') as f:
        wordDict = f.read().splitlines()
        f.close()

    valued = map(lambda x: (getValue(x), x), wordDict)
    print max(valued)

Here is the dictionary I used, with one hyphenated entry removed for convenience.

Anagnorisis answered 28/4, 2015 at 12:7 Comment(1)

You forgot to check whether you can form the word with the given characters. – Megilp 28/4, 2015 at 13:43

Can we assume that the dictionary is fixed and the score are fixed and that only the letters available will change (as in scrabble) ? Otherwise, I think there is no better than looking up each word of the dictionnary as previously suggested.

So let's assume that we are in this setting. Pick an order < that respects the costs of letters. For instance Q > Z > J > X > K > .. > A >E >I .. > U.

Replace your dictionary D with a dictionary D' made of the anagrams of the words of D with letters ordered by the previous order (so the word buzz is mapped to zzbu, for instance), and also removing duplicates and words of length > 8 if you have at most 8 letters in your game.

Then construct a trie with the words of D' where the children nodes are ordered by the value of their letters (so the first child of the root would be Q, the second Z, .., the last child one U). On each node of the trie, also store the maximal value of a word going through this node.

Given a set of available characters, you can explore the trie in a depth first manner, going from left to right, and keeping in memory the current best value found. Only explore branches whose node's value is larger than you current best value. This way, you will explore only a few branches after the first ones (for instance, if you have a Z in your game, exploring any branch that start with a one point letter as A is discarded, because it will score at most 8x1 which is less than the value of Z). I bet that you will explore only a very few branches each time.

Ascham answered 28/4, 2015 at 15:7 Comment(3)

Is this really faster than ordering the dictionary by word score. turning your word into a list of possible arrangements and then checking if each dictionary word (in order) is contained in any of the words in your anagram list? o(n)? (excluding ordering of dictionary) – Oleoresin 28/4, 2015 at 16:7

That's a good point and it probably depends on a lot of thinks. I would think your approach works better when you can form many words with your game, mine better when you can form fewer words or when you have a very unbalanced game (in which case you probably stop the exploration very fast). – Ascham 29/4, 2015 at 10:42

I guess in a real scrabble game you will need all possible real words, regardless of score, so that you can check the board for possible plays. In which case a trie must be faster? – Oleoresin 29/4, 2015 at 11:17

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags