Scoring a string based on how English-like it is

Asked 29/7, 2011 at 19:51 Answered 27/5, 2016 at 23:44

I'm not sure how exactly to word this question, so here's an example:

string1 = "THEQUICKBROWNFOX" string2 = "KLJHQKJBKJBHJBJLSDFD"

I want a function that would score string1 higher than string2 and a million other gibberish strings. Note the lack of spaces, so this is a character-by-character function, not word-by-word.

In the 90s I wrote a trigram-scoring function in Delphi and populated it with trigrams from Huck Finn, and I'm considering porting the code to C or Python or kludging it into a stand-alone tool, but there must be more efficient ways by now. I'll be doing this millions of times, so speed is nice. I tried the Reverend.Thomas Beyse() python library and trained it with some all-caps-strings, but it seems to require spaces between words and thus returns a score of []. I found some Markov Chain libraries, but they also seemed to require spaces between words. Though from my understanding of them, I don't see why that should be the case...

Anyway, I do a lot of cryptanalysis, so in the future scoring functions that use spaces and punctuation would be helpful, but right now I need just ALLCAPITALLETTERS.

Thanks for the help!

Nimbus answered 29/7, 2011 at 19:51 Comment(7)

There's a possible drawback to using a "vanilla" Markov model (ngrams), which is that if the plaintext contains something impossible in English (or not in your model) such as "QX" then a Markov model will assign it a zero likelihood. Once you get the likelihood from the Markov model, you can transform it using Bayes' Theorem to estimate the probability that you have successfully decrypted the text. At this point a working statistician would adjust the order of the Markov model to achieve the desired accuracy. – Gerardgerardo 29/7, 2011 at 22:21

@Dietrich The standard approach to this is to increment all counts by one, and assign 'impossible' edges a value of 1 - making them slightly less likely than the least likely transition you actually observed. – Roar 1/8, 2011 at 1:43

@Nick: This seems like it would scale poorly -- with a large enough dictionary, a count of 1 would be practically 0. – Gerardgerardo 1/8, 2011 at 2:18

@Dietrich Correct - reflecting the fact that in that entire corpus, you didn't encounter a single instance of that transition, making it less likely than the least likely transition you did observe. – Roar 1/8, 2011 at 3:40

@Nick: But this means that as the dictionary size increases, the algorithm's tolerance for typos in the input approaches zero -- while the actual probability that there are typos in the input remains the same. – Gerardgerardo 1/8, 2011 at 4:15

@Dietrich Only if your corpus doesn't include typos. Your corpus should resemble the text you want to recognize to the greatest degree possible. – Roar 1/8, 2011 at 13:28

@Nick: But that's not generally practical -- "I am from the planet Qxbxznk" is recognizeably English, but it could be impractical to predict that it is in the plaintext. – Gerardgerardo 1/8, 2011 at 23:43

I would start with a simple probability model for how likely each letter is, given the previous (possibly-null, at start-of-word) letter. You could build this based on a dictionary file. You could then expand this to use 2 or 3 previous letters as context to condition the probabilities if the initial model is not good enough. Then multiply all the probabilities to get a score for the word, and possibly take the Nth root (where N is the length of the string) if you want to normalize the results so you can compare words of different lengths.

Ejector answered 29/7, 2011 at 20:1 Comment(6)

That sounds like a bigram probability function, in which case I might as well port my trigram code as it is probably more accurate. Also, I chose to build my stats from Huck Finn as it was a large collection of natural, informal language. I fear a dictionary may skew probabilities away from oft-used words like "the" and towards more obscure patterns. It also may not include slang. I should really test that some day... – Nimbus 29/7, 2011 at 21:14

A dictionary would also not have bigrams and trigrams from words that are commonly next to each other like "if it" would have the bigrams "if," "fi" and "it." Taking out spaces changes everything... – Nimbus 29/7, 2011 at 21:24

In otherwords, you are suggesting to "Compute the likelihood function using a Markhov model." Typically, you don't take the nth root of the result, instead one usually takes the logarithm -- which usually simplifies the math a great deal. Forgive the weasel words... there is really no set way to do this. – Gerardgerardo 29/7, 2011 at 22:14

If you took the logarithm, you'd then divide the result by N to eliminate dependence on the length of the word. – Ejector 29/7, 2011 at 22:19

@R..: I belive that would be statistically unsound. The longer the piece of text is, the more certainty you can have that the text is English. – Gerardgerardo 1/8, 2011 at 2:23

@Dietrich Yes - the probability represents the likeilhood of that particular string being generated from the input corpus if you selected edges at (weighted) random. If you want a figure that's independent of length, you'd need to multiply it by the probability of generating any string of that length - |a|^n (where a is the alphabet) would probably be a good approximation. – Roar 1/8, 2011 at 13:30

I don't see why a Markov chain couldn't be modified to work. I would create a text file dictionary of sorts, and read that in to initially populate the data structure. You would just be using a chain of n letters to predict the next letter, rather than n words to predict the next word. Then, rather than randomly generating a letter, you would likely want to pull out the probability of the next letter. For instance if you had the current chain of "TH" and the next letter was "E", you would go to your map, and see the probability that an "E" would follow "TH". Personally I would simply add up all of these probabilities while looping through the string, but how to exactly create a score from the probability is up to you. You could normalize it for string length, to let you compare short and long strings.

Now that I think about it, this method would favor strings with longer words, since a dictionary would not include phrases. Then again, you could populate the dictionary with not only single words, but short phrases with the spaces removed as well. Then the scoring would not only score based on how english the seperate words are, but how english series of words are. It's not a perfect system, but it would provide consistent scoring.

Herzig answered 29/7, 2011 at 20:25 Comment(2)

Removing spaces and punctuation and making every letter capital in an entire book is no problem. But, I was hoping to find a good library as Markov chains seemed popular... Looks like I may be modifying my trigram code instead. – Nimbus 29/7, 2011 at 21:19

writing markov code is actually fairly simple. The whole algorithm only took a hundred lines or two in java. I agree the dictionary would be the big obstacle, but the advantage is that it would allow you to not only look for the english-ness, but any thing you can put into a text file to read in and create the data structure. – Herzig 30/7, 2011 at 2:55

I don't know how it works, but Mail::SpamAssassin::Plugin::TextCat analyzes email and guesses what language it is (with dozens of languages supported).

Irenics answered 29/7, 2011 at 20:41 Comment(1)

I found lots of language-guessing tools and stackoverflow posts, but I only have one language and all I care about is how plausible a string is to be actual language versus gibberish. – Nimbus 29/7, 2011 at 21:4

The Index of Coincidence might be of help here, see https://en.wikipedia.org/wiki/Index_of_coincidence.

For a start just compute the difference of the IC to the expected value of 1.73 (see Wikipedia above). For an advanced usage you might want to calculate the expected value yourself using some example language corpus.

Maravedi answered 27/5, 2016 at 23:44 Comment(0)

-1

I'm thinking that maybe you could apply some text-to-speech synthesis ideas here. In particular, if a speech synthesis program is able to produce a pronunciation for a word, then that can be considered "English."

The pre-processing step is called grapheme-to-phoneme conversion, and typically leads to probabilities of mapping strings to sounds.

Here's a paper that describes some approaches to this problem. (I don't claim this paper is authoritative, as it just was a highly ranked search result, and I don't really have expertise in this area.)

Fourposter answered 30/7, 2011 at 0:54 Comment(1)

It's fairly easy for a program to tell if an individual word is English; it can look it up in a dictionary (just a text file full of known English words). The tricky thing in the OP's question is that the word boundaries aren't known. Also, from what I could tell by scanning that paper, the algorithm discussed in that paper is always able to produce a guess at a pronunciation for any string of characters, so it doesn't seem to return a "was able to produce a pronunciation" value that could be used as a signal. – Doha 14/1, 2018 at 14:38

Recommended topics

Hot tags