Detecting random keyboard hits considering QWERTY keyboard layout

Asked 27/9, 2010 at 8:41 Answered 21/9, 2015 at 3:53

Solved algorithm n-gram qwerty text-classification

The winner of a recent Wikipedia vandalism detection competition suggests that detection could be improved by "detecting random keyboard hits considering QWERTY keyboard layout".

Example: woijf qoeoifwjf oiiwjf oiwj pfowjfoiwjfo oiwjfoewoh

Is there any software that does this already (preferably free and open source) ?

If not, is there an active FOSS project whose goal is to achieve this?

If not, how would you suggest to implement such a software?

Douglassdougy answered 27/9, 2010 at 8:41 Comment(3)

Vandalism detection algorithms already include dictionary/grammar-based detection, so here I am looking for an algorithm that does NOT use dictionaries or grammar, but rather finger patterns. – Douglassdougy 27/9, 2010 at 8:45

and how exactly 'finger patterns' differ from dictionary entries plus grammar rules? It is the same approach, the distinction is that one is positive detection and the other negative detection. Furthermore - it is not clear what you are asking for - random keyboard hits considering qwerty is no different then random keyboard hits considering dvorak, unless they are not really random (maybe better call it 'commonly used vandalism constructs'). – Hardener 27/9, 2010 at 10:45

@Unreason: About your first question: I meant dictionaries and grammars of existing human languages. The "negative detection" you propose is interesting, feel free to propose it as an answer. About the "Furthermore": I reformulate my question: You are given a sequence of characters that have been typed on a QWERTY keyboard, how do you calculate the probability that it has been typed carelessly? (ie:by someone whose goal was not to express something but to quickly enter many characters, for instance oiuroiqewrcoqf) – Douglassdougy 27/9, 2010 at 11:21

If two bigrams in analyzed text are close in QWERTY terms but have near zero statistical frequency in English language (like pairs "fg" or "cd") then there is chance that random keyboard hits are involved. If more such pairs are found then chance increases greatly.

If you want to take into account the use of both hands for bashing then test letters that are separated with another letter for QWERTY closeness, but two bigrams (or even trigrams) for bigram frequency. For example in text "flsjf" you would check F and S for QWERTY distance, but bigrams FL and LS (or trigram FLS) for frequency.

Homes answered 27/9, 2010 at 11:51 Comment(5)

+1 this sounds good, but first the list of these common bigrams for gibberish needs to extracted; otherwise the end result would be based on guesstimates (guessing which bigrams or trigrams are characteristic for gibberish). – Hardener 27/9, 2010 at 11:57

Maybe for OP it needs to be stated that bigram matching is the common algorithm found in spell checkers – Hardener 27/9, 2010 at 12:0

Accepted. For reference, I would like to add that repetition of an unusual bigram is a quasi-sure sign. – Douglassdougy 4/10, 2010 at 7:42

so to go back to Nicolas question: is there any open source lib that implemented this type of logic? – Cysteine 14/10, 2013 at 18:17

@Cysteine to that question I'm no smarter than Google – Homes 14/10, 2013 at 23:8

Consider empirical distribution of sequences of two letters, ie "probability of having letter a given it follows letter b", all this probabilities fill a table of size 27x27 (considering space as a letter).

Now, compare this with historical data from a bunch of english/french/whatever texts. Use Kullback divergence for comparison.

Walkabout answered 27/9, 2010 at 12:4 Comment(3)

Am I right that to implement your solution I need a corpus of "mashed text" ? – Douglassdougy 27/9, 2010 at 12:30

you need a corpus of standard english text (like wikipedia articles). – Walkabout 27/9, 2010 at 12:31

I think only considering the last version of the article (unless it's really short) is likely to work for the Wikipedia example. – Variometer 27/9, 2010 at 16:34

Most keyboard mashing tends to be on the home row in my experience. It would be reasonably simple to check to see if a high proportion of the characters used are asdfjkl;.

Feune answered 27/9, 2010 at 9:18 Comment(1)

wow I never noticed that, but that's so true about my random mashing! – Anticipant 27/9, 2010 at 11:58

Taking an approach based on keyboard layout will provide a good indicator. With a QWERTY layout you will find that around 52% of letters in any given text will be from the top line of keyboard characters. About 32% of characters will be from the middle line and 14% of will be from bottom line. While this varies slightly from one language to another, there remains a very clear pattern which can be detected. Use the same methodology to discover patterns in other keyboard layouts, then ensure you detect the layout used for any text entered before checking for gibberish. Even though the pattern is clear, it is best to use this method as one indicator only given that this methodology works best with longer scripts. Using other indicators such as non-alpha/numeric characters mixed with alpha/numeric, text length etc will provide further indicators which when applying weighting, can provide a pretty good overall indication of gibberish entry.

Gwyn answered 21/9, 2015 at 3:53 Comment(0)

Fredley's answer can be extended to a grammar that would construct words from nearby letters.

For example asasasasasdf could be generated with a grammar that connects as, sa, sd and df.

With such grammar, expanded to all letters on the keyboard (with letters that are next to each other) could, after parsing, give you a measure of how much of a text can be generated with this 'gibberish' grammar.

Caveat: of course, any text discussing such grammar and listing examples of 'gibberish' text would score significantly higher then a regular spell-checked text.

Do note that the example approach would not catch vandalism in the form of 'h4x0r rulezzzzz!!!!!'.

Another approach here (which can be integrated with the above method) would be to statistically analyze a corpus of vandalized text and try to get common words in vandalized texts.

EDIT:
Since you are assuming QWERTY, I guess we could assume English, too?

What about KISS - run the text through english spell checker and if it fails miserably conclude that it is probably gibberish (the question is, why want to distinguish quickly typed gibberish from random nonsense or for that matter from very badly spelled text?)

Alternatively if other keyboard layouts (Dvorak, anyone?) and languages are to be considered, then maybe run the text through all available language spell checkers and then proceed (this would give language autodetect, too).

This would not be very efficient method, but could be used as a baseline test.

Note:
In the long run I imagine that vandals would adapt and start vandalizing with, for example excerpts from other wikipedia pages, which would be ultimately hard to automatically detect as vandalism (ok, existing texts could be checksummed and flag raised on duplicates, but if text came from some other source it would be ultimately hard).

Hardener answered 27/9, 2010 at 11:54 Comment(1)

About your "Do note" paragraph: Indeed, the 'h4x0r rulezzzzz!!!!!' case is not targeted here, and it is actually taken care of by other means, which the winner's paper talks about. In brief: Character repetition of "zzzzz" and excessive punctuation would already mark it as probable vandalism. – Douglassdougy 27/9, 2010 at 12:1

Recommended topics

Hot tags