Improve OCR accuracy from scanned documents
Asked Answered
L

1

2

I'm scanning a lot of A3 documents using a standard Brother A3 Multifunction and then use FineReader Pro for OCR'ing the images.

However, I'm getting a lot of errors in the characters recognized, and lots of non-alphanumeric strange characters.

Can someone give me any tips for programmatically improving the OCR accuracy, either pre-processing on the scanned images, or post-processing on the recognized text?


Edit: Find a sample pdf. It includes some sample images from which I get the poorest results.

Lawn answered 11/1, 2011 at 14:2 Comment(2)
What does the question have to do with programming?Supermarket
Image processing IS math/programming amzn.to/ef6KR4Lawn
E
2

Do you have a sample image you can post somewhere then we can quickly tell you what is causing most of your problems. FineReader is one of the better OCR engines out there so there are definitely reasons why you are getting poor results.

It could be related to poor contrast and threshold settings, image skewing, dirty rollers in the scanner, complex and coloured backgrounds, dithered backgrounds, font sizes too small, scanning dpi being too low etc...

After seeing the attached image there are a few small issues.

  1. There are lots of dirty specks on the background page. FineReader seems to do a reasonable job with this on your images.
  2. There is some slight skew but that is not causing and problems.
  3. FineReader is getting confused with BOLD tall Arial type font used for column headers.
    4 A big problem seems to be the bottom region of the pages where the contrast is poor and the image is fuzzy. This seems to be a problem with the scanner but could be due to printing problems.

The printing is quite poor and I am guessing it is a scan from a newspaper. Most of your errors are due to scanning issues so it would be hard to programmatically improve the results.

Firstly, I would try scanning the image in grayscale using a slightly higher resolution and see if that helps. FineReader works well with grayscale images. If you have to have a B/W image then see if the scanner driver includes a setting for dynamic thresholding and turn it on.

Your images would not be an easy task for any OCR engine. You will get better results if you can improve the scanning. Page 3 has a lot of noise in the bottom right corner.

What version of FineReasder are you using ? FR10 would probably give better results than previous versions.

Exerciser answered 12/1, 2011 at 1:58 Comment(1)
Thanks for the help! I am going to follow your suggestions and compare the results. Yes, I do use FR10.Lawn

© 2022 - 2024 — McMap. All rights reserved.