Python- What word can have the most consecutive letters removed and still be a dictionary-valid word?
Asked Answered
D

3

12

I used this hideous and inefficient implementation to find the word that can have the most consecutive last letters removed and still be a word.

Rodeo, for example, is a well-known one: Rodeo, Rode, Rod, Ro. The program found 'composers': Composers, Composer, Compose, Compos, Comp

I was wondering how I should go about creating a program that finds the longest word that can have ANY of its letters (not just the last ones) removed and it still be considered a word:

For example: beast, best, bet, be -- would be a valid possibility

Here was my program to find the one that removes consecutive letters (I'm also interested in hearing how this can be improved and optimized):

#Recursive function that finds how many letters can be removed from a word and
#it still be valid.  
def wordCheck(word, wordList, counter):

    if len(word)>=1:
        if word in wordList:
            return (wordCheck(word[0:counter-1], wordList, counter-1))
        else:
            return counter
    return counter


def main():
    a = open('C:\\Python32\\megalist2.txt', 'r+')
    wordList = set([line.strip() for line in a])
    #megaList contains a sorted list of tuple of 
    #(the word, how many letters can be removed  consecutively)
    megaList = sorted([(i, len(i)-1- wordCheck(i, wordList, len(i))) for i in wordList], key= lambda megaList: megaList[1])


    for i in megaList:
        if i[1] > 3:
            print (i)

if __name__ == '__main__':
    main()
Drip answered 21/5, 2011 at 22:18 Comment(5)
Note that r+ opens the file for writing, too. Unless your program will actually modify your dictionary, I suggest changing the open mode to just r.Nielsen
For your original problem you could generate a special kind of Radix tree from all dictionary words, the longest word is the longest path in this tree.Splenius
A very similar question has been asked recently so it might be worthwhile to try a few searches. This is generally related to a specialized case of anagrams and #881059 might be a good start.Booking
The longest word that starts with "a", perhaps, sample code: print "antidisestablishmentarianistically" ;)Kirkkirkcaldy
@FMc: No. The point of the project was to find words such that for each letter you remove, it still produces a valid letter. In order for "autosuggestible" to work-- the following would have to be dictionary-valid words: autosuggestibl, autosuggestib, autosuggesti, etc.Drip
C
8

Here's an implementation I just wrote up. It runs in about five seconds with my ~235k word list. The output doesn't show the whole chain, but you can easily reassemble it from the output.

# Load the words into a dictionary
words = dict((x.strip(), set()) for x in open("/usr/share/dict/words"))

# For each word, remove each letter and see if the remaining word is still
# in the dictionary. If so, add it to the set of shorter words associated with
# that word in the dictionary.
# For example, bear -> {ear, bar, ber}
for w in words:
    for i in range(len(w)):
        shorter = w[:i] + w[i+1:]
        if shorter in words:
            words[w].add(shorter)

# Sort the words by length so we process the shortest ones first
sortedwords = sorted(words, key=len)

# For each word, the maximum chain length is:
#  - the maximum of the chain lengths of each shorter word, if any
#  - or 0 if there are no shorter words for this word
# Note that because sortedwords is sorted by length, we will always
# have maxlength[x] already available for each shorter word x
maxlength = {}
for w in sortedwords:
    if words[w]:
        maxlength[w] = 1 + max(maxlength[x] for x in words[w])
    else:
        maxlength[w] = 0

# Print the words in all chains for each of the top 10 words
toshow = sorted(words, key=lambda x: maxlength[x], reverse=True)[:10]
while toshow:
    w = toshow[0]
    print(w, [(x, maxlength[x]) for x in words[w]])
    toshow = toshow[1:] + list(x for x in words[w] if x not in toshow)

The longest word chain in my dictionary is:

  • abranchiate
  • branchiate
  • branchiae
  • branchia
  • branchi
  • branch
  • ranch
  • rach
  • ach
  • ah
  • a
Colicweed answered 21/5, 2011 at 22:45 Comment(3)
Hey, Greg. I appreciate you writing this, but I'm a bit confused as to how it actually works. Do you mind commenting it, or explaining the idea behind it?Drip
@Parseltongue: I've added inline comments. Let me know if there's anything else that isn't clear.Colicweed
This is the dynamic programming approach to the problem. It is mostly equivalent to the other graph search solution I wrote up: the dictionary bear -> {ear, bar ber} represents pointers to memoizable subproblems is equivalent to the nodes W pointing to the W's. I ignored the issue of caching maxlengths, which is a decent optimization.Aeschines
A
10
for each english word W:
    for each letter you can remove:
        remove the letter
        if the result W' is also word:
            draw a line W->W'
for each english word W:
    connect ROOT-> each english word W
use a graph search algorithm to find the longest path starting at ROOT
    (specifically, the words are now in a directed acyclic graph; traverse
    the graph left-right-top-down, i.e. in a "topological sort", keeping
    track of the longest candidate path to reach each node; this achieves 
    the result in linear time)

this algorithm only takes linear O(#wordsInEnglish*averageWordLength) time! basically as long as it takes to read the input

It can easily be modified to find consecutive letters removed: rather than keeping a single candidate per node like (Node('rod').candidate = rodeo->rode->rod), keep a family of candidates per node AND the index of the letter you removed to get each candidate (Node('rod').candidates={rodeo->rod|e->rod|, road->ro|d}). This has the same running time.

Aeschines answered 21/5, 2011 at 22:31 Comment(3)
I assume this is the optimal solution, but I know nothing about graph search algorithms or how to programatically "draw a line" between two letters. Do you have any good references I can read to learn more about this subject? I'm a hobbyist programmer that just likes to pursue fun projects to learn more, but I'd like to start learning about algorithms.Drip
Just keep in mind that you can make it only O(#wordsInEnglish*averageWordLength) amortised with a good collection for the words. Otherwise the time of inserts and lookups to the words will take over.Rama
@Parseltongue: Sorry, I can clarify. I meant: first take an initialized empty graph, add all the words as nodes, and then by "draw a line" I actually mean add an edge Node(W)->Node(W') in the graph. This algorithm might be explained on en.wikipedia.org/wiki/… It's good to be familiar with DAGs (directed acyclic graphs). A good starting point may be some common graph algorithms known as graph-search algorithms, including breadth-first search, depth-first search, and Dijkstra's (shortest path).Aeschines
C
8

Here's an implementation I just wrote up. It runs in about five seconds with my ~235k word list. The output doesn't show the whole chain, but you can easily reassemble it from the output.

# Load the words into a dictionary
words = dict((x.strip(), set()) for x in open("/usr/share/dict/words"))

# For each word, remove each letter and see if the remaining word is still
# in the dictionary. If so, add it to the set of shorter words associated with
# that word in the dictionary.
# For example, bear -> {ear, bar, ber}
for w in words:
    for i in range(len(w)):
        shorter = w[:i] + w[i+1:]
        if shorter in words:
            words[w].add(shorter)

# Sort the words by length so we process the shortest ones first
sortedwords = sorted(words, key=len)

# For each word, the maximum chain length is:
#  - the maximum of the chain lengths of each shorter word, if any
#  - or 0 if there are no shorter words for this word
# Note that because sortedwords is sorted by length, we will always
# have maxlength[x] already available for each shorter word x
maxlength = {}
for w in sortedwords:
    if words[w]:
        maxlength[w] = 1 + max(maxlength[x] for x in words[w])
    else:
        maxlength[w] = 0

# Print the words in all chains for each of the top 10 words
toshow = sorted(words, key=lambda x: maxlength[x], reverse=True)[:10]
while toshow:
    w = toshow[0]
    print(w, [(x, maxlength[x]) for x in words[w]])
    toshow = toshow[1:] + list(x for x in words[w] if x not in toshow)

The longest word chain in my dictionary is:

  • abranchiate
  • branchiate
  • branchiae
  • branchia
  • branchi
  • branch
  • ranch
  • rach
  • ach
  • ah
  • a
Colicweed answered 21/5, 2011 at 22:45 Comment(3)
Hey, Greg. I appreciate you writing this, but I'm a bit confused as to how it actually works. Do you mind commenting it, or explaining the idea behind it?Drip
@Parseltongue: I've added inline comments. Let me know if there's anything else that isn't clear.Colicweed
This is the dynamic programming approach to the problem. It is mostly equivalent to the other graph search solution I wrote up: the dictionary bear -> {ear, bar ber} represents pointers to memoizable subproblems is equivalent to the nodes W pointing to the W's. I ignored the issue of caching maxlengths, which is a decent optimization.Aeschines
S
1

Maybe I'm just missing the point of the exercise, but shouldn't a simple heuristic rule be able to cut down a lot of the search? Especially if you are trying to just find a single word that can have the most letters cut, you'd probably just want to look at the largest words and check if they contain any of the smallest ones.

For example, a huge number of words include the letters "a" and "i" which are both valid English words. Moreover, longer words will be increasingly likely to have one or both letters. You can probably skip any word that doesn't have an "a" or "i" immediately.

You could probably work this into Greg's solution, actually, if you get your sorted copy of the word list first, i.e.:

# Similar to Greg's.  Reads into a dict
words = dict((x.strip(), None) for x in open("/usr/share/dict/words"))
# Generates a reverse sorted list from the dict (largest words first)
sortedwords = sorted(words, key=len, reverse=True)

# Largest possible reduction is making a longest word into 1 letter
longestPossible = len(sortedWords[0])-1

# Function that recursively finds shorter words and keeps count of reductions
def getMaxLettersRemoved(w, words, alreadyRemovedCount=0):
    # If you've already calculated the value, return it
    if words[w] is not None:
        return words[w]
    # Recursively calculate how many letters you can remove
    shorterPossibilities = [w[:i] + w[i+1:] for i in xrange(len(w))]
    # Calculate how max # of letters you can remove from shortened words
    totalRemoved = max([getMaxLettersRemoved(w, words, alreadyRemovedCount+1) for shorter in shorterPossibilities if shorter in words])
    # Total removed will be the same or will increase due to removals from shorter words
    totalRemoved = max(totalRemoved, alreadyRemovedCount)
    # Cache the result and return it
    words[w] = totalRemoved
    return totalRemoved 

# Go through words from largest to smallest, so long as they have 'a' or 'i'
bestNumRemoved = 0
for w in sortedwords:
    if 'a' in w or 'i' in w:
        # Get the number of letters you can remove
        numRemoved = getMaxLettersRemoved(w, words)
        # Save the best one found
        if numRemoved > bestNumRemoved:
            bestWord = w
            bestNumRemoved = numRemoved 
        # Stop if you can't do better
        if bestNumRemoved >= len(w)-1:
            break

# Need to make sure the best one found is better than any left
if bestNumRemoved < longestPossible:
    for w in sortedwords:
        # Get the number of letters you can remove
        numRemoved = getMaxLettersRemoved(w, words)
        # Save the best one found
        if numRemoved > bestNumRemoved:
            bestWord = w
            bestNumRemoved = numRemoved 
        # Stop if you can't do better
        if bestNumRemoved >= len(w)-2:
            break

So this one differs in a few ways. Firstly, it sorts first so you can get the largest words. Secondly, it completely ignores any word not containing an 'a' or an 'i' on the first pass through. Thirdly, it doesn't need to compute every word or the entire tree in order to produce a result. Instead, it just does a recursive look through the words as they're needed.

Every time it cuts out a letter and finds a new word, it runs the same function to find out the number of letters it can cut out of that smaller word, plus the number already removed from whatever root word it came from. This should in theory be fairly fast because it won't need to run on most words, since it does a typical optimization trick: checking if it's at an optimal bound. First, it finds the best possibility among those with 'i' or 'a'. Then, it checks words longer than the best one found to make sure that there isn't a better option that doesn't contain either letter but is at least 2 letters longer (so it could theoretically be better).

There's probably some improvements on this that could do an even better job taking advantage of the regularities of English using a probablistic algorithm, but I'd suspect this would be all right as a deterministic one. Additionally, I don't have a dictionary on-hand so I can't actually uhm... run this, but the concepts are sound.

Additionally, I'm not entirely convinced that sorting the list of keys is worth it. While the python sort algorithm works quite fast, it's still dealing with a big list and could have a pretty significant cost to it. An ideal algorithm would probably have to take that cost into account and decide if it's worth it or not (probably not). If one is not sorting the list, you'd probably want the first pass to only consider words of a certain minimal length- maybe even as part of a larger loop. There's really no sense in calculating anything to do with words that might have nothing to do with the solution.

Solley answered 22/5, 2011 at 7:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.