How to find possible English words in long random string?

Asked 12/10, 2013 at 19:10 Answered 5/1, 2017 at 5:26

Solved python dictionary information-retrieval trie

I'm doing an artistic project where I want to see if any information emerges from a long string of characters (~28,000). It's sort of like the problem one faces in solving a Jumble. Here's a snippet:

jfifddcceaqaqbrcbdrstcaqaqbrcrisaxohvaefqiygjqotdimwczyiuzajrizbysuyuiathrevwdjxbinwajfgvlxvdpdckszkcyrlliqxsdpunnvmedjjjqrczrrmaaaipuzekpyqflmmymedvovsudctceccgexwndlgwaqregpqqfhgoesrsridfgnlhdwdbbwfmrrsmplmvhtmhdygmhgrjflfcdlolxdjzerqxubwepueywcamgtoifajiimqvychktrtsbabydqnmhcmjhddynrqkoaxeobzbltsuenewvjbstcooziubjpbldrslhmneirqlnpzdsxhyqvfxjcezoumpevmuwxeufdrrwhsmfirkwxfadceflmcmuccqerchkcwvvcbsxyxdownifaqrabyawevahiuxnvfbskivjbtylwjvzrnuxairpunskavvohwfblurcbpbrhapnoahhcqqwtqvmrxaxbpbnxgjmqiprsemraacqhhgjrwnwgcwcrghwvxmqxcqfpcdsrgfmwqvqntizmnvizeklvnngzhcoqgubqtsllvppnedpgtvyqcaicrajbmliasiayqeitcqtexcrtzacpxnbydkbnjpuofyfwuznkf

What's the most efficient way of searching for all possible English words embedded (both forwards and backwards) in this string?

What is a useful dictionary against which to check the substrings? Is there a good library for doing this sort of thing? I have searched around and found some interesting TRIE solutions; but most of them are dealing with the situation where you know the set of words in advance.

Softshoe answered 12/10, 2013 at 19:10 Comment(6)

Well, once you pick a dictionary, you do "know the set of words in advance": they're the set of words in the dictionary you pick. – Inhumation 12/10, 2013 at 19:17

Start with the first character. Is it a word? Add the next character. Is it a word? Is there a word that starts with those characters? NO - remove the first character from the big string and start over. YES - Is it a word? Is there a word that starts with those characters? NO - remove the first character from the big string and start over. YES - Is it a word? ........... – Uplift 12/10, 2013 at 19:50

Or maybe a better idea- is the first dictionary word in the string? Is the second dictionary word in the string? Is the third dictionary word in the string? ... Is the nth dictionary word in the string? – Uplift 12/10, 2013 at 19:54

You can use a binary/bisection search to search through the wordlist to see if a group of letters is a word or a prefix. – Uplift 12/10, 2013 at 20:1

I'm not sure that it's worth worrying about efficiency; a quick brute-force search took only 0.6 seconds to find ~1700 words from a 98000-word dictionary in a 28000-character string. – Thumbstall 12/10, 2013 at 20:14

(1) Jam your dictionary into a trie (2) Iterate over your data and see if you can match a trie entry against the current start position. – Trochal 12/10, 2013 at 20:59

I used this solution to find all words forwards and backwards from a corpus of 28,000 random characters in a dictionary of 100,000 words in .5 seconds. It runs in O(n) time. It takes a file called "words.txt" which is a dictionary that has words separated by some kind of whitespace. I used the default unix wordlist in /usr/share/dict/words but I'm sure you can find plenty of text file dictionaries online if not that one.

from random import choice
import string

dictionary = set(open('words.txt','r').read().lower().split())
max_len = max(map(len, dictionary)) #longest word in the set of words

text = ''.join([choice(string.ascii_lowercase) for i in xrange(28000)])
text += '-'+text[::-1] #append the reverse of the text to itself

words_found = set() #set of words found, starts empty
for i in xrange(len(text)): #for each possible starting position in the corpus
    chunk = text[i:i+max_len+1] #chunk that is the size of the longest word
    for j in xrange(1,len(chunk)+1): #loop to check each possible subchunk
        word = chunk[:j] #subchunk
        if word in dictionary: #constant time hash lookup if it's in dictionary
            words_found.add(word) #add to set of words

print words_found

Workingwoman answered 12/10, 2013 at 20:15 Comment(6)

One very minor point: text += text[::-1] could introduce a problem, because it's possible that "red" only exists at the end of text, and after adding the reverse you have "redder" in the centre, which isn't a word in the original string. – Thumbstall 12/10, 2013 at 20:20

I was aware of this but didn't think it a big enough deal to mention. I should have known SO would call me out ;) I edited it to fix the problem. – Workingwoman 12/10, 2013 at 20:23

Clearly you have for i in range, for j in range .... how could this run on O(n) time? – Ducktail 9/2, 2016 at 0:17

j is bounded by max_len, which is a constant (max_len is the length of the longest word in the dictionary). – Workingwoman 10/2, 2016 at 1:30

I think I have this working partially. I am using the test string 'abzcatrwxadfsa' and getting the results of >>> print (words_found) {'sava', 'saya', 'saka', 'alab', 'sara', 'ahab', 'saha', 'safa', 'sana', 'sala', 'sapa', 'sasa', 'arab', 'saga', 'sada', 'saba', 'sawa', 'anab', 'sama', 'akab', 'zcat'} The results contain letters not even in the string. What could be going wrong? – Faze 24/1, 2021 at 10:9

@Faze When I run the code with your string substituted instead of the 28,000 random characters, I get the following result:

set(['ab', 'ad', 'ca', 'as', 'at', 'ax', 'ta', 'ba', 'da', 'a', 'c', 'b', 'd', 'f', 'cat', 's', 'r', 't', 'w', 'x', 'sa', 'z'])

. Are you sure you are replacing the code that generates 28K random characters? That is, I change line 7 to text = "abzcatrwxadfsa". – Workingwoman 29/1, 2021 at 2:16

Here is a bisection/binary search that should be usefull.

def isaprefix(frag, wordlist, first, last):
    """
    Recursive binary search of wordlist for words that start with frag.

    assumes wordlist is a sorted list
    typically called with first = 0 and last = len(wordlist)

    first,last -->> integer
    returns bool
    """

    # base case - down to two elements
    if (last - first) < 2:
        # return False unless frag is a prefix
        # of either of the two remaining words
        return wordlist[first].startswith(frag) or wordlist[last].startswith(frag)

    #mid = (first + last)/2
    midword = wordlist[(first + last) / 2]

    # go ahead and return if you find one
    # a second base case?
    if midword.startswith(frag):
        return True

    #print word, ' - ', wordlist[mid], ' - ', wordlist[mid][:len(word)], ' - ', isprefix
    # start the tests
    # python does just fine comparing strings
    if frag < midword:
        # set the limits to the lower half
        # of the previous range searched and recurse
        return isaprefix(frag, wordlist, first, mid-1)

    # frag is > midword: set the limits to the upper half
    # of the previous range searched and recurse
    return isaprefix(frag, wordlist, mid+1, last)

Uplift answered 12/10, 2013 at 20:10 Comment(0)

You can think of creating a sequence out of the entire dictionary and then aligning them to get the words in the sequence using smith water man or any heuristic local alignment algorithm

Newsom answered 5/1, 2017 at 5:26 Comment(0)

Recommended topics

Hot tags