Build a natural language model that fixes misspellings

Asked 10/2, 2010 at 12:53 Answered 10/2, 2010 at 13:23

What are books about how to build a natural language parsing program like this:

input: I got to TALL you
output: I got to TELL you

input: Big RAT box
output: Big RED box

in: hoo un thum zend three
out: one thousand three

It must have the language model that allows to predict what words are misspelled !

What are the best books on how to build such a tool??

p.s. Are there free webservices to spell-check? From Google maybe?..

Freezer answered 10/2, 2010 at 12:53 Comment(6)

+1 for misspelling "misspelling". That was a joke, right? – Upright 10/2, 2010 at 13:0

@Upright hahaha kinda. Fast typing, but it demonstrates how such tools can be useful. – Freezer 10/2, 2010 at 13:6

+1, Use Spelly in Google Wave! :P – Heteroousian 10/2, 2010 at 13:11

Your first two example were GRAMMAR corrections, not spelling corrections. For example, if I had a box I kept rats in, it would be a RAT box. If it were big, it would be a Big RAT box. In the first case an adjective took the place of a verb. (relatively speaking easy to detect) In the second case, as per my example, the word RAT (typically a noun) can also be an adjective. So the grammar for that sentence fragment is fine. (Assuming you don't care that it's not a sentence) Best of luck to you! – Reglet 10/2, 2010 at 13:14

@Jason D I need grammar corrections AS WELL. I gotta be sure that the context will tell which word is the most probable in that position. – Freezer 10/2, 2010 at 13:46

From my second example, I certainly hope you understand that it is impossible to discern the INTENT of the typer from what's typed. I would suggest a lot of user interaction to train it on a per-person (author) basis. If I were to be speaking of a red box in the (mythical) prior sentence, then likely calling out a Big RAT box would raise some eyebrows among most readers. However if I were speaking merely of a box, and I alluded to keeping rodents in it, then most readers would assume a RAT BOX is CORRECT. Hence the role of a higher order of CONTEXT which NLP has not yet achieved. – Reglet 11/2, 2010 at 15:19

Peter Norvig has written a terrific spell checker. Maybe that can help you.

Gambeson answered 10/2, 2010 at 12:58 Comment(3)

Just was going to link it :-) +1 – Lorrin 10/2, 2010 at 12:58

Cool script. Seems like it would be straightforward to extend it to word bigrams or trigrams if you had a corpus of correct text in the language of choice. – Kalimantan 10/2, 2010 at 13:10

Exactly, that's the script that I tried to remember in my post below. +1 – Zurheide 10/2, 2010 at 14:2

You have at least three options

You can write a program which understands the language (i.e. what a word means). This is a topic for research today. Expect the first results when you can buy a computer which is fast enough to run such a program (which is probably in 10 years when computers have become 1000 times faster than today).
Use a huge corpus (text documents) to train a Hidden Marcov Model.
Use a huge corpus and generate statistics about ~~quadruplets~~ n-grams, i.e. how often a tuple of N words appears. I don't have a link handy for this but the idea is that some words always appear in the context of other words. So when you parse your text into 4-grams and look them up in your database and you can't find one, chances are that there is something wrong with the current tuple. The next step is to find all possible matches (other 4-grams which have a small soundex or similar distance to the current one) and try the one with the highest frequency.

Google has this data for quite a few languages and you might find more in Google labs about this.

[EDIT] After some googling, I finally found the link: On this page, you can buy English 1- to 5-grams which Google collected over the whole Internet on 6 DVDs.

Googling for "google spelling statistics n-grams" will also turn up some interesting links.

Reed answered 10/2, 2010 at 13:7 Comment(2)

Will Google share this data with me? ;) – Freezer 10/2, 2010 at 13:11

I think so. I must really find the link again. – Reed 10/2, 2010 at 13:41

soundex (wiki) is one option

Pinckney answered 10/2, 2010 at 12:57 Comment(2)

As George Bernard Shaw (amongst many others) always complained, there is often a great divergence between how things are spelled and how they are pronounced. At least in English. SOUNDEX() might be an effective approach in, say, Italian. – Upright 10/2, 2010 at 13:47

This one is built into the Delphi RTL, its pretty unpredictable, but fairly cool - good for people who like write fenetiklee err.. phonetically. – Terrapin 10/2, 2010 at 15:32

There are quite a few Java libraries for natural language processing that would help you implement a spelling corrector. But you asked about a book. Foundations of Statistical Natural Language Processing by Christopher D. Manning and Hinrich Schütze looks like a good option. The first author is a Stanford Professor leading a group that does natural language processing and developing Java libraries and NLP resources that many people use.

Kalimantan answered 10/2, 2010 at 13:23 Comment(0)

In Dev Days London, Michael Sparks presented a Python script coded exactly for that. It was surprisingly very simple! See if you can find in Google. Maybe somebody here will have the link.

Zurheide answered 10/2, 2010 at 12:59 Comment(1)

According to the DevDays thread on MetaSO, the script Michael Sparks presented on was the Peter Norvig script already mentioned: meta.stackexchange.com/questions/27859/… – Upright 10/2, 2010 at 13:15

Recommended topics

Hot tags