Checking if a string contains an English sentence
Asked Answered
H

5

7

As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.

It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)

I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.

The strings I'm checking will look like this:

"hithisisastringthatmustbechecked"

but most of them contained complete garbage, just random letters.

I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.

Hera answered 2/8, 2013 at 17:41 Comment(18)
Is it not possible for words to be seperated by spaces? Must you verify all letters make words, or is it sufficient that at least one English word is detected? Do you order words by frequency of use and start with the most common?Depoliti
What are you actually trying to accomplish? Are there ever spaces in the strings? Do you need this to be fully accurate or is a probabilistic guess good? Are the garbage lines random characters or what?Bassist
I would first generate a cache of frequently referenced NON-words -- prefexes of maybe 4-6 characters that are never valid. Several ways to do this.Ichabod
I can see a few ways to do it, but I would think the magic sauce would be in the dictionary data structure. Maybe store the dictionary as a tree with each node being a letter, so all the paths to leaves yield a full word. (so h-a-t<-h-a<-h). Than iterate over the line once (from start to end) to see if the beginning of that sub-string is a word. Bonus points if the nodes know the longest path to a word for early search exit.Masticate
Eg, let say that "stbe" is not the start of any valid word. When you get to "stbe" in the above string you check your cache, see it's not valid, and quickly move on.Ichabod
@HotLicks The string he posted is one he considers validUnmarked
And, for the dictionary of valid words, @MadScienceDreams is effectively describing a radix tree, a good way to express textual choices, though a bit expensive to create and maintain.Ichabod
@NeilKirk The string can't have any spaces, and once one English word is detected I'm good to go.Hera
@HotLicks your method would eliminate "firstbed" which contains two English words...Masticate
@Unmarked - I understand. But "stbe" is not, in itself, the start of a valid word (I don't think).Ichabod
@HotLicks Like I said with 'tm' example, it wont work because there are no spaces in between the wordsHera
@HotLicks Ah, I see, was misinterpreting what you meant by "move on"Unmarked
@MadScienceDreams - Would not, since you would have already recognized "first".Ichabod
Nicholas - (Thing is, if a single English word is sufficient to pass, any string containing "i" or "a" will pass.)Ichabod
@HotLicks "rstbed" than, though I think we've already covered this, so I'll stop whining :-/Masticate
(With my above suggestion one presumes you are scanning from the front, so if you have "firstbed" then "first" would have been recognized as valid before "stbe" was recognized as invalid. Ie, "firs" would not have been found in the list of invalid word starts.)Ichabod
@HotLicks One and two letters words are not checked, the string is assumed to have at least one three letter word. (Its hard to communicate with only 1 and 2 letter words, so the string I'm deciphering couldn't be important)Hera
I go to AA - A sentence with meaning and all two letter words. Took a while until I thought of an airline.Jaws
J
1

There are a lot of strategies for doing this quickly.

Idea 1

Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.

So the array looks like this:

a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth

Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.

Idea 2

Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.

So you have this in the array:

a - substr[0] = "astringthatmustbechecked"

But if the longest word in the list is 5 letters, there is no need to keep any more than:

a - substr[0] = "astri"

If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.

e - substr[4] = "echecked"

You can expand upon this by using the longest words starting with any particular letter when condensing the strings.

Idea 3

This has nothing to do with 1 and 2. Its an idea that you could use instead.

You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.

Assume these are the words in the dictionary:

arun
bob
bill
billy
body
jose

Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)

a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
|    |              |
|    o -> b -> *    y -> *
|         |
|         d -> y -> *
|
j -> o -> s -> e -> *

The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.

The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".

The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.

Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.

This is a lot of work but can, sometimes, be handy.

Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.

Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.

That was some fast character matching and really handy for some reason that eludes me today.

Jaws answered 2/8, 2013 at 17:56 Comment(4)
This is a great idea! I will try to implement it and if it gives me the desired speed, I'll accept the answer. :) The list contains every English language, and the strings I'm checking isn't that large, so Idea 2 wont work.Hera
That would be useful if the list was small, but almost every combination exists when checking the entire dictionary.Hera
But remember, with idea 2, you only search any particular string from the array with words starting with a particular letter. So the number of combinations doesn't matter. What matters is that the word can only match starting in places where there is that particular letter.Jaws
I have no memory of what I was doing 5 years ago when writing this, but I just happen to be digging through old posts. Funny how little I knew 5 years ago, but yes the solution is trivial with a Trie. Accepted due to "Idea 3".Hera
N
5

You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.

Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.

Needs answered 2/8, 2013 at 17:50 Comment(3)
Do you know OP's string.find() doesn't already use this algorithm?Jaws
@LeeMeador Even if it does, it couldn't reuse the pre-built search table, and would be forced to rebuild it every time through the find function.Needs
I see. I reread and noticed there are multiple "to be searched" strings. Unless they are pretty long, I'd be loathe to try to replace a "built in" function to search. Those are probably pretty well optimized at a different level than the KMP algorithm applies.Jaws
J
1

There are a lot of strategies for doing this quickly.

Idea 1

Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.

So the array looks like this:

a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth

Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.

Idea 2

Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.

So you have this in the array:

a - substr[0] = "astringthatmustbechecked"

But if the longest word in the list is 5 letters, there is no need to keep any more than:

a - substr[0] = "astri"

If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.

e - substr[4] = "echecked"

You can expand upon this by using the longest words starting with any particular letter when condensing the strings.

Idea 3

This has nothing to do with 1 and 2. Its an idea that you could use instead.

You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.

Assume these are the words in the dictionary:

arun
bob
bill
billy
body
jose

Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)

a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
|    |              |
|    o -> b -> *    y -> *
|         |
|         d -> y -> *
|
j -> o -> s -> e -> *

The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.

The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".

The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.

Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.

This is a lot of work but can, sometimes, be handy.

Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.

Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.

That was some fast character matching and really handy for some reason that eludes me today.

Jaws answered 2/8, 2013 at 17:56 Comment(4)
This is a great idea! I will try to implement it and if it gives me the desired speed, I'll accept the answer. :) The list contains every English language, and the strings I'm checking isn't that large, so Idea 2 wont work.Hera
That would be useful if the list was small, but almost every combination exists when checking the entire dictionary.Hera
But remember, with idea 2, you only search any particular string from the array with words starting with a particular letter. So the number of combinations doesn't matter. What matters is that the word can only match starting in places where there is that particular letter.Jaws
I have no memory of what I was doing 5 years ago when writing this, but I just happen to be digging through old posts. Funny how little I knew 5 years ago, but yes the solution is trivial with a Trie. Accepted due to "Idea 3".Hera
V
1

How about a Bloom Filter?

A Bloom filter, conceived by Burton Howard Bloom in 1970 is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not; i.e. a query returns either "inside set (may be wrong)" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter). The more elements that are added to the set, the larger the probability of false positives.

The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).

It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.

Varhol answered 2/8, 2013 at 18:17 Comment(3)
How do you use this. Extract every valid substring of the string OP is searching and pass it the Bloom filter? That seems like a lot of work to extract all those substrings even if you use the info that you only care about ones of length 3 and longer and shorter than the longest word in the dictionary.Jaws
Yes - tokenizing the parent string into candidate substrings will be time consuming. The Bloom filter is simply an idea to rapidly test whether a substring is not in the dictionary (in which case we rapidly move on).Varhol
I like the thought that it turns the search around. Instead of applying all words against the "to be searched" string's substrings, it applies all the substrings against a representation of the list of words to look for.Jaws
D
1

This code was modified from How to split text without spaces into list of words?:

from math import log

words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    costsum = 0
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        costsum += c
        i -= k

    return costsum

Using the same dictionary of that answer and testing your string outputs

>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056

The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).

Doti answered 18/6, 2014 at 23:41 Comment(0)
F
0

In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?

The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).

Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:

alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation

It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.

I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.

Fullerton answered 13/12, 2013 at 20:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.