Best open-source spell-checker for OCR? [closed]
Asked Answered
S

1

7

I have a large number of English OCRed documents from the 19th century and want to clean up some of the OCR errors by using a contextual spell-checker such as the one proposed by Peter Norvig at http://norvig.com/spell-correct.html. My main goal is to be able to use a probabilistic model (together with the ocred text data and an appropriate and large dictionary) to be able to correct words that are misspelled.

I am happy using the code that Norvig gives in his website and improving it, but before I do so, I would like to ask if there is an open-source solution for this. Norivg himself suggests looking at aspell, but I don't think that aspell is a contextual spell-checker, and I'm worried it might not work so well on OCR error correction.

Subconscious answered 19/2, 2017 at 23:27 Comment(2)
Make any progress on this?Selah
The best one I've seen is still Peter Norvig's code ...Subconscious
P
0

So, you're looking for a spell checker that will substitute the most probabilistic choice whenever there is a phrase or word it doesn't understand? That seems like it would be a bad idea on 19c texts unless you have a large corpus of such texts that have already been spell checked by hand. Words that were commonplace then but rare now will be replaced without your knowledge. I daresay, you may find a contextual spell-checker trained on modern locution to be tetotaciously exflunctified by your 19c phraseology. ☺

If you have such a corpus, or you're up for creating one, there is a powerful Python based tool for OCR and analysis called OCRopus. It uses natural language processing, neural networks and many other buzzwords — I think I saw "deep learning" on the to-do list. It does not appear easy to use, though I admit I've never tried it myself. It seems to require skill at the command line and programming in Python. If you're still not daunted, it may be exactly what you're looking for.

On the other hand, if you are looking for something simpler, consider using a program with a standard spell checker. For example, gImageReader which can read in your PDF files, OCR them, and let you correct & add the words it doesn't know. I suggest at least trying a simple spell checker before searching for something more complicated.

Screenshot of gImageReader spellchecking the word "?RND(1);"

Plagioclase answered 10/2, 2019 at 8:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.