Need good OCR for printed source code listing, any ideas?
Asked Answered
T

8

19

I sometimes have to take some printed source code and manually type the source code into a text editor.

Obviously typing it up takes a long time and always extra time to debug typing errors (oops missed a "$" sign there).

I decided to try some OCR solutions like:

  • Microsoft Document Imaging - has built in OCR
  • Result: Missed all the leading whitespace, missed all the underscores, interpreted many of the punctuation characters incorrectly.
  • Conclusion: Slower than manually typing in code.
  • Various online web OCR apps
  • Result: Similar or worse than Microsoft Document Imaging
  • Conclusion: Slower than manually typing in code.

I feel like source code would be very easy to OCR given the font is sans serif and monospace.

Have any of you found a good OCR solution that works well on source code?

Maybe I just need a better OCR solution (not necessarily source code specific)?

Thanhthank answered 11/12, 2009 at 14:54 Comment(0)
S
7

With OCR, there are currently three options:

  • Abbee FineReader and OminPage. Both are commercial products which are about on par when it comes to features and OCR result. I can't say much about OmniPage but FineReader does come with support for reading source code (for example, it has a Java language library).
  • The best OSS OCR engine is tesseract. It's much harder to use, you'll probably need to train it for your language.

I rarely do OCR but I've found that spending the $150 on the commercial software weights out the wasted time by far.

Sheeting answered 11/12, 2009 at 15:11 Comment(5)
I tried tesseract. It failed when I first downloaded it. The online readme specifies that it doesn't come with any training data. I downloaded the English training data from the website and untarred into tessdata subdir. BUT then it still complained about "could not find eng.unicharset". How am I messing this up?Thanhthank
See what I mean? Tesseract is only free if your time costs nothing. But you can post questions in the tesseract user group. They are friendly there and your input will help to make it easier for the next person to set this beast up.Sheeting
@Aaron Digulla, sir can u share me some OCR libraries that comes under the range $150 to $500 ,Yakka
@Sajjad I don't know any.Sheeting
Would like to point out that without training, tesseract does nothing different from a regular ocr, which will ignore all the leading whitespace, missed all the underscores. However, it is also difficult to train it, because you need to spend time to get the label for each sample.Faradmeter
E
6

Two new options exists today (years after the question was asked):

1.)

Windows 10 comes with an OCR engine from Microsoft.

It is in the namespace:

Windows.Media.Ocr.OcrEngine

https://msdn.microsoft.com/en-us/library/windows/apps/windows.media.ocr

There is also an example on Github:

https://github.com/Microsoft/Windows-universal-samples/tree/master/Samples/OCR

You need either VS2015 to compile this stuff. Or if you want to use an older version of Visual Studio you must invoke it via traditional COM, then read this article on Codeproject: http://www.codeproject.com/Articles/262151/Visual-Cplusplus-and-WinRT-Metro-Some-fundamentals

The OCR quality is very good. Nevertheless if the text is too small you must amplify the image before. You can download every language that exists in the world via Windows Update - even for handwriting!


2.)

Another option is to use the OCR library from Office. It is a COM DLL. It is available in Office 2003, 2007 and Vista, but has been removed in Office 2010.

http://www.codeproject.com/Articles/10130/OCR-with-Microsoft-Office

The disadvantage is that every Office installation comes with support for few languages. For example a spanish Office installs support for spanish, english, portuguese and french. But I noticed that it nearly makes no difference if you use spanish or english as OCR language to detect a spanish text.

If you convert the image to greyscale you get better results. The recognition is OK, but it did not satisfy me. It makes approximately as much errors as Tesseract although Tesseract needs much more image preprocessing to get these results.

Egocentric answered 26/7, 2016 at 16:36 Comment(0)
H
5

Google Drive's built-in OCR worked pretty well for me. Just convert scans to a PDF, upload to Google Drive, and choose "Open with... Google Docs". There are some weird things with color and text size, but it still includes semicolons and such.

The original screenshot: original screenshot The Google Docs OCR: Google Docs OCR

Plaintext version:

#include <stdio.h> int main(void) { 
char word[51]; int contains = -1; int i = 0; int length = 0; scanf("%s", word); while (word[length] != "\0") i ++; while ((contains == 1 || contains == 2) && word[i] != "\0") { 
if (word[i] == "t" || word[i] == "T") { 
if (i <= length / 2) { 
contains = 1; } else contains = 2; 
return 0; 
Howl answered 19/5, 2020 at 0:18 Comment(1)
I tried this to scan in a print out from 1994 of a tank clone I built and it worked great! Thanks for the idea.Gabbard
T
3

Try http://www.free-ocr.com/. I have used it to recover source code from a screen grab when my IDE crashes in an editor session without warning. It obviously depends on the font you are using in the editor (I use Courier New 10pt in Delphi). I tried to use Google Docs, which will OCR an image when you upload it - while Google Docs is pretty good on scanned documents, it fails miserably on Pascal source for some reason.

An example of FreeOCR at work: Input image:

image uploaded

gave this:

begin
FileIDToDelete := FolderToClean + 5earchRecord.Name ;
Inc (TotalFilesFound) ;
if (DeleteFile (PChar (FileIDToDelete))) then
begin
Log5tartupError (FormatEx (‘%s file %s deleted‘, [Annotation, Fi eIDToDelete])) ;
Inc (TotalFilesDeleted) ;
end
else
begin
Log5tartupError (FormatEx (‘Error deleting %s file %s‘, [Annotat'on, FileIDToDelete])) ;
Inc (TotalFilesDeleteErrors) ;
end ;
end ;
FindResult := 5ysUtils.FindNext (5earchRecord) ;
end ;

so replacing the indentation is the bulk of the work, then changing all 5's to upper case S. It also got confused by the vertical line at the 80 column mark. Luckily most errors will be picked up by the compiler (with the exception of mistakes inside quoted strings).

It's a shame FreeOCR doesn't have a "source code" option, where white space is treated as significant.

A tip: If your source includes syntax highlighting, make sure you save the image as grayscale before uploading.

Typebar answered 22/4, 2015 at 21:4 Comment(0)
F
2

The trick with fixed pitch images such as old-style lineprinter printouts is to segment the image into individual characters geometrically rather than by attempting to identify blobs of ink and work out where they should be within a line.

As far as I know, there is no existing OCR system out there which does this before attempting to recognise the glyphs. Which is why I am writing one :-) So far I have successfully overlaid a grid over a page image which captures each printed character squarely within the grid. It's interesting to note that the grid spacing must be at a sub-pixel level - I'm using a resolution of 0.1 pixels for the grid separation. Obviously you clip to the nearest whole pixel when applying the grid.

The next phase of my project will be recognising the extracted glyphs. Having them in a bounding box definitely will make that part easier, with respect to accents, differentiating commas from identical-looking apostrophes, etc. If there's interest from folks here I'll be happy to open this up to a collaboration once I get the basic framework in place. By the way, my primary interest for this is in recovering historical software from old printouts, from the 60's to the 70's mostly. I sketched up the idea for a fixed-pitch OCR back around 1980 but didn't get around to implementing it until last week - being retired I have a bit more time for fun projects...

Algol W source code

Fredel answered 16/3 at 22:8 Comment(2)
IMO the hardest part will be automatically detecting the grid offset x/y and the letter width/height. but probably would not be too expensive computationally. IMO your approach is great idea because it solves the issue of not getting the correct number of spaces.Thanhthank
The grid detection now works (for one test document), and I've just started on the character matching, which although a rather naive algorithm seems to be working too, at least for a small demo. If you're interested I just now put the code online at gtoal.com/src/OCR - not a usable system yet, just my proof of concept experiments.Fredel
A
1

Printed text vs handwritten is usually easier for OCR, however it all depends on your source image, I generally find that capturing in PNG format, with reduced colors (grayscale is best) with some manual cleanup (remove any image noise due to scanning etc) works best.

Most OCR are similar in performance and accuracy. OCRs with the ability to train/correct would be best.

Assignment answered 11/12, 2009 at 14:58 Comment(0)
S
1

In general I found that FineReader gives very good results. Normally all products has a trial available. Try as much you can.

Now, program source code can be tricky:

  • leading whitespace: maybe a post code pretty printer process can help
  • underscores and punctuation: maybe a good product can be trained for that
Shiism answered 11/12, 2009 at 15:7 Comment(0)
G
1

OCRopus is also a good open source option. But like Tesseract, there's a rather steep learning curve to use and integrate it effectively.

Gunlock answered 11/12, 2009 at 15:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.