Maybe I'm just missing the point of the exercise, but shouldn't a simple heuristic rule be able to cut down a lot of the search? Especially if you are trying to just find a single word that can have the most letters cut, you'd probably just want to look at the largest words and check if they contain any of the smallest ones.
For example, a huge number of words include the letters "a" and "i" which are both valid English words. Moreover, longer words will be increasingly likely to have one or both letters. You can probably skip any word that doesn't have an "a" or "i" immediately.
You could probably work this into Greg's solution, actually, if you get your sorted copy of the word list first, i.e.:
# Similar to Greg's. Reads into a dict
words = dict((x.strip(), None) for x in open("/usr/share/dict/words"))
# Generates a reverse sorted list from the dict (largest words first)
sortedwords = sorted(words, key=len, reverse=True)
# Largest possible reduction is making a longest word into 1 letter
longestPossible = len(sortedWords[0])-1
# Function that recursively finds shorter words and keeps count of reductions
def getMaxLettersRemoved(w, words, alreadyRemovedCount=0):
# If you've already calculated the value, return it
if words[w] is not None:
return words[w]
# Recursively calculate how many letters you can remove
shorterPossibilities = [w[:i] + w[i+1:] for i in xrange(len(w))]
# Calculate how max # of letters you can remove from shortened words
totalRemoved = max([getMaxLettersRemoved(w, words, alreadyRemovedCount+1) for shorter in shorterPossibilities if shorter in words])
# Total removed will be the same or will increase due to removals from shorter words
totalRemoved = max(totalRemoved, alreadyRemovedCount)
# Cache the result and return it
words[w] = totalRemoved
return totalRemoved
# Go through words from largest to smallest, so long as they have 'a' or 'i'
bestNumRemoved = 0
for w in sortedwords:
if 'a' in w or 'i' in w:
# Get the number of letters you can remove
numRemoved = getMaxLettersRemoved(w, words)
# Save the best one found
if numRemoved > bestNumRemoved:
bestWord = w
bestNumRemoved = numRemoved
# Stop if you can't do better
if bestNumRemoved >= len(w)-1:
break
# Need to make sure the best one found is better than any left
if bestNumRemoved < longestPossible:
for w in sortedwords:
# Get the number of letters you can remove
numRemoved = getMaxLettersRemoved(w, words)
# Save the best one found
if numRemoved > bestNumRemoved:
bestWord = w
bestNumRemoved = numRemoved
# Stop if you can't do better
if bestNumRemoved >= len(w)-2:
break
So this one differs in a few ways. Firstly, it sorts first so you can get the largest words. Secondly, it completely ignores any word not containing an 'a' or an 'i' on the first pass through. Thirdly, it doesn't need to compute every word or the entire tree in order to produce a result. Instead, it just does a recursive look through the words as they're needed.
Every time it cuts out a letter and finds a new word, it runs the same function to find out the number of letters it can cut out of that smaller word, plus the number already removed from whatever root word it came from. This should in theory be fairly fast because it won't need to run on most words, since it does a typical optimization trick: checking if it's at an optimal bound. First, it finds the best possibility among those with 'i' or 'a'. Then, it checks words longer than the best one found to make sure that there isn't a better option that doesn't contain either letter but is at least 2 letters longer (so it could theoretically be better).
There's probably some improvements on this that could do an even better job taking advantage of the regularities of English using a probablistic algorithm, but I'd suspect this would be all right as a deterministic one. Additionally, I don't have a dictionary on-hand so I can't actually uhm... run this, but the concepts are sound.
Additionally, I'm not entirely convinced that sorting the list of keys is worth it. While the python sort algorithm works quite fast, it's still dealing with a big list and could have a pretty significant cost to it. An ideal algorithm would probably have to take that cost into account and decide if it's worth it or not (probably not). If one is not sorting the list, you'd probably want the first pass to only consider words of a certain minimal length- maybe even as part of a larger loop. There's really no sense in calculating anything to do with words that might have nothing to do with the solution.
r+
opens the file for writing, too. Unless your program will actually modify your dictionary, I suggest changing theopen
mode to justr
. – Nielsen