How to find possible English words in long random string?
Asked Answered
S

3

6

I'm doing an artistic project where I want to see if any information emerges from a long string of characters (~28,000). It's sort of like the problem one faces in solving a Jumble. Here's a snippet:

jfifddcceaqaqbrcbdrstcaqaqbrcrisaxohvaefqiygjqotdimwczyiuzajrizbysuyuiathrevwdjxbinwajfgvlxvdpdckszkcyrlliqxsdpunnvmedjjjqrczrrmaaaipuzekpyqflmmymedvovsudctceccgexwndlgwaqregpqqfhgoesrsridfgnlhdwdbbwfmrrsmplmvhtmhdygmhgrjflfcdlolxdjzerqxubwepueywcamgtoifajiimqvychktrtsbabydqnmhcmjhddynrqkoaxeobzbltsuenewvjbstcooziubjpbldrslhmneirqlnpzdsxhyqvfxjcezoumpevmuwxeufdrrwhsmfirkwxfadceflmcmuccqerchkcwvvcbsxyxdownifaqrabyawevahiuxnvfbskivjbtylwjvzrnuxairpunskavvohwfblurcbpbrhapnoahhcqqwtqvmrxaxbpbnxgjmqiprsemraacqhhgjrwnwgcwcrghwvxmqxcqfpcdsrgfmwqvqntizmnvizeklvnngzhcoqgubqtsllvppnedpgtvyqcaicrajbmliasiayqeitcqtexcrtzacpxnbydkbnjpuofyfwuznkf

What's the most efficient way of searching for all possible English words embedded (both forwards and backwards) in this string?

What is a useful dictionary against which to check the substrings? Is there a good library for doing this sort of thing? I have searched around and found some interesting TRIE solutions; but most of them are dealing with the situation where you know the set of words in advance.

Softshoe answered 12/10, 2013 at 19:10 Comment(6)
Well, once you pick a dictionary, you do "know the set of words in advance": they're the set of words in the dictionary you pick.Inhumation
Start with the first character. Is it a word? Add the next character. Is it a word? Is there a word that starts with those characters? NO - remove the first character from the big string and start over. YES - Is it a word? Is there a word that starts with those characters? NO - remove the first character from the big string and start over. YES - Is it a word? ...........Uplift
Or maybe a better idea- is the first dictionary word in the string? Is the second dictionary word in the string? Is the third dictionary word in the string? ... Is the nth dictionary word in the string?Uplift
You can use a binary/bisection search to search through the wordlist to see if a group of letters is a word or a prefix.Uplift
I'm not sure that it's worth worrying about efficiency; a quick brute-force search took only 0.6 seconds to find ~1700 words from a 98000-word dictionary in a 28000-character string.Thumbstall
(1) Jam your dictionary into a trie (2) Iterate over your data and see if you can match a trie entry against the current start position.Trochal
W
10

I used this solution to find all words forwards and backwards from a corpus of 28,000 random characters in a dictionary of 100,000 words in .5 seconds. It runs in O(n) time. It takes a file called "words.txt" which is a dictionary that has words separated by some kind of whitespace. I used the default unix wordlist in /usr/share/dict/words but I'm sure you can find plenty of text file dictionaries online if not that one.

from random import choice
import string

dictionary = set(open('words.txt','r').read().lower().split())
max_len = max(map(len, dictionary)) #longest word in the set of words

text = ''.join([choice(string.ascii_lowercase) for i in xrange(28000)])
text += '-'+text[::-1] #append the reverse of the text to itself

words_found = set() #set of words found, starts empty
for i in xrange(len(text)): #for each possible starting position in the corpus
    chunk = text[i:i+max_len+1] #chunk that is the size of the longest word
    for j in xrange(1,len(chunk)+1): #loop to check each possible subchunk
        word = chunk[:j] #subchunk
        if word in dictionary: #constant time hash lookup if it's in dictionary
            words_found.add(word) #add to set of words

print words_found
Workingwoman answered 12/10, 2013 at 20:15 Comment(6)
One very minor point: text += text[::-1] could introduce a problem, because it's possible that "red" only exists at the end of text, and after adding the reverse you have "redder" in the centre, which isn't a word in the original string.Thumbstall
I was aware of this but didn't think it a big enough deal to mention. I should have known SO would call me out ;) I edited it to fix the problem.Workingwoman
Clearly you have for i in range, for j in range .... how could this run on O(n) time?Ducktail
j is bounded by max_len, which is a constant (max_len is the length of the longest word in the dictionary).Workingwoman
I think I have this working partially. I am using the test string 'abzcatrwxadfsa' and getting the results of >>> print (words_found) {'sava', 'saya', 'saka', 'alab', 'sara', 'ahab', 'saha', 'safa', 'sana', 'sala', 'sapa', 'sasa', 'arab', 'saga', 'sada', 'saba', 'sawa', 'anab', 'sama', 'akab', 'zcat'} The results contain letters not even in the string. What could be going wrong?Faze
@Faze When I run the code with your string substituted instead of the 28,000 random characters, I get the following result: set(['ab', 'ad', 'ca', 'as', 'at', 'ax', 'ta', 'ba', 'da', 'a', 'c', 'b', 'd', 'f', 'cat', 's', 'r', 't', 'w', 'x', 'sa', 'z']). Are you sure you are replacing the code that generates 28K random characters? That is, I change line 7 to text = "abzcatrwxadfsa".Workingwoman
U
1

Here is a bisection/binary search that should be usefull.

def isaprefix(frag, wordlist, first, last):
    """
    Recursive binary search of wordlist for words that start with frag.

    assumes wordlist is a sorted list
    typically called with first = 0 and last = len(wordlist)

    first,last -->> integer
    returns bool
    """

    # base case - down to two elements
    if (last - first) < 2:
        # return False unless frag is a prefix
        # of either of the two remaining words
        return wordlist[first].startswith(frag) or wordlist[last].startswith(frag)

    #mid = (first + last)/2
    midword = wordlist[(first + last) / 2]

    # go ahead and return if you find one
    # a second base case?
    if midword.startswith(frag):
        return True

    #print word, ' - ', wordlist[mid], ' - ', wordlist[mid][:len(word)], ' - ', isprefix
    # start the tests
    # python does just fine comparing strings
    if frag < midword:
        # set the limits to the lower half
        # of the previous range searched and recurse
        return isaprefix(frag, wordlist, first, mid-1)

    # frag is > midword: set the limits to the upper half
    # of the previous range searched and recurse
    return isaprefix(frag, wordlist, mid+1, last)
Uplift answered 12/10, 2013 at 20:10 Comment(0)
N
0

You can think of creating a sequence out of the entire dictionary and then aligning them to get the words in the sequence using smith water man or any heuristic local alignment algorithm

Newsom answered 5/1, 2017 at 5:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.