image processing to improve tesseract OCR accuracy
Asked Answered
B

15

208

I've been using tesseract to convert documents into text. The quality of the documents ranges wildly, and I'm looking for tips on what sort of image processing might improve the results. I've noticed that text that is highly pixellated - for example that generated by fax machines - is especially difficult for tesseract to process - presumably all those jagged edges to the characters confound the shape-recognition algorithms.

What sort of image processing techniques would improve the accuracy? I've been using a Gaussian blur to smooth out the pixellated images and seen some small improvement, but I'm hoping that there is a more specific technique that would yield better results. Say a filter that was tuned to black and white images, which would smooth out irregular edges, followed by a filter which would increase the contrast to make the characters more distinct.

Any general tips for someone who is a novice at image processing?

Bordelon answered 28/2, 2012 at 10:12 Comment(0)
F
146
  1. fix DPI (if needed) 300 DPI is minimum
  2. fix text size: e.g. 12 pt should be ok for tesseract 3.x (a.k.a as legacy engine) new: best accuracy with tesseract >= 4.x (LSTM engine) is with height of capital letters at 30-33 pixels
  3. try to fix text lines (deskew and dewarp text)
  4. try to fix illumination of image (e.g. no dark part of image)
  5. binarize and de-noise image

There is no universal command line that would fit to all cases (sometimes you need to blur and sharpen image). But you can give a try to TEXTCLEANER from Fred's ImageMagick Scripts.

If you are not fan of command line, maybe you can try to use opensource scantailor.sourceforge.net or commercial bookrestorer.

Faludi answered 5/4, 2012 at 18:46 Comment(14)
And there's illustrated guide on how to do this: code.google.com/p/tesseract-ocr/wiki/ImproveQualityDiscrete
Note, the linked script appears to be linux-only.Aeolic
@ZoranPavlovic you are correct. Link is for the linux only.Courtenay
This is not true - this is a bash script. If you have installed bash and ImageMagick it will run on windows too. Bash could installed as part of other usefull software e.g. git or msys2...Faludi
Hi, Can you help me out on this #32473595Ladykiller
@Discrete Since moved to github. wiki page is at: github.com/tesseract-ocr/tesseract/wiki/ImproveQualityKasher
I was using different kind of image processing with OpenCV and pretty much used Gaussian Blur (11x11) and applied a binary+otsu threshold. Then changed DPI to 200 (even though you recommended a minimum of 300) and became the finishing touch to fix my problem. My old DPI was set to 75.Lim
@Lim Can you tell how you increased DPI to 200? I want to do it through a Python code or may be through command line.Orvah
@Faludi Any ideas how to change the text size, let us say from 10 to 12 pt, through Python?Orvah
@SKR: first 2 advises are about controlling quality of input- If you have scan with result 4 pt letters, there is not technique to improve it. Resizing 10 or 11 pt letters to 12 could not have big impact, but you can try. Just search for "python resize"...Faludi
scantailor is in the debian/ubuntu software repository and has a command-line interface. scantailor-cli --dpi=200 --ouput-dpi=400 input.png outfolder worked well for me.Dissert
The Tesseract docs moved again, to tesseract-ocr.github.io/tessdoc/ImproveQualityProcumbent
Wikipedia says: "Tesseract's output will have very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels,[13] any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters." I scaled the image using Gimp and it was accurately read.Aport
Why you prefer information from unknown bystander (Wikipedia) instead of real source (tesseract documentation)?Faludi
P
91

I am by no means an OCR expert. But I this week had the need to convert text out of a jpg.

I started with a colorized, RGB 445x747 pixel jpg. I immediately tried tesseract on this, and the program converted almost nothing. I then went into GIMP and did the following.

  • image > mode > grayscale
  • image > scale image > 1191x2000 pixels
  • filters > enhance > unsharp mask with values of
    radius = 6.8, amount = 2.69, threshold = 0

I then saved as a new jpg at 100% quality.

Tesseract then was able to extract all the text into a .txt file

Gimp is your friend.

Poorhouse answered 24/7, 2012 at 18:45 Comment(5)
+1 I followed your steps and I got a great improvement. ThanksLickerish
I also have the impression that Tesseract works better if you convert the input to a TIFF file and give Tesseract the TIFF (rather than asking Tesseract to do the conversion for you). ImageMagick can do the conversion for you. This is my anecdotal impression, but I haven't tested it carefully, so it could be wrong.Phemia
+1 The "unsharp mask" filter really made my day. Another step that helped me: using the "fuzzy selection" tool select the background then press Del for wightening it outTisbe
I am stuck on this image processing issue before tesseract recognition #32473595 Can you help me out here?Ladykiller
Is it a way to automatically perform these steps in Python?Lorettelorgnette
B
65

As a rule of thumb, I usually apply the following image pre-processing techniques using OpenCV library:

  1. Rescaling the image (it's recommended if you’re working with images that have a DPI of less than 300 dpi):

    img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
    
  2. Converting image to grayscale:

    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
  3. Applying dilation and erosion to remove the noise (you may play with the kernel size depending on your data set):

    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)
    
  4. Applying blur, which can be done by using one of the following lines (each of which has its pros and cons, however, median blur and bilateral filter usually perform better than gaussian blur.):

    cv2.threshold(cv2.GaussianBlur(img, (5, 5), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    
    cv2.threshold(cv2.bilateralFilter(img, 5, 75, 75), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    
    cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    
    cv2.adaptiveThreshold(cv2.GaussianBlur(img, (5, 5), 0), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
    
    cv2.adaptiveThreshold(cv2.bilateralFilter(img, 9, 75, 75), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
    
    cv2.adaptiveThreshold(cv2.medianBlur(img, 3), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
    

I've recently written a pretty simple guide to Tesseract but it should enable you to write your first OCR script and clear up some hurdles that I experienced when things were less clear than I would have liked in the documentation.

In case you'd like to check them out, here I'm sharing the links with you:

Bergquist answered 8/6, 2018 at 14:15 Comment(3)
why do we convert the image to gray-scale? To be more specific, I have seen in the image detection process, the image is first converted to gray scale, then sobel->MSER -> SWT. could you please elaborate it? I am new in the IP field.Va
As to my understanding, it depends upon the algorithm, some may not need to convert at all. Think of pixels as a few color values stored digitally -in the case of RGB, red, green, and blue-. When a pixel is converted to the B/W scale, then your algorithm needs to work on only 2 dimensions, instead of 3. This comes with obvious advantages in speed when running your algorithm on pixels one by one. Further, some may also say that it's easier to remove the noise and detect the edges on a picture when it is converted to grayscale.Bergquist
Thank you for the response. And about your blog, could you please write one on HOW TO BUILD OCR FROM SCRATCH USING TESSERACT for non-Roman script. I have searching everywhere, all that is available right aren't clear.Va
C
33

Three points to improve the readability of the image:

  1. Resize the image with variable height and width(multiply 0.5 and 1 and 2 with image height and width).

  2. Convert the image to Gray scale format(Black and white).

  3. Remove the noise pixels and make more clear(Filter the image).

Refer below code :

Resize

public Bitmap Resize(Bitmap bmp, int newWidth, int newHeight)
        {
         
                Bitmap temp = (Bitmap)bmp;
            
                Bitmap bmap = new Bitmap(newWidth, newHeight, temp.PixelFormat);
             
                double nWidthFactor = (double)temp.Width / (double)newWidth;
                double nHeightFactor = (double)temp.Height / (double)newHeight;

                double fx, fy, nx, ny;
                int cx, cy, fr_x, fr_y;
                Color color1 = new Color();
                Color color2 = new Color();
                Color color3 = new Color();
                Color color4 = new Color();
                byte nRed, nGreen, nBlue;

                byte bp1, bp2;

                for (int x = 0; x < bmap.Width; ++x)
                {
                    for (int y = 0; y < bmap.Height; ++y)
                    {

                        fr_x = (int)Math.Floor(x * nWidthFactor);
                        fr_y = (int)Math.Floor(y * nHeightFactor);
                        cx = fr_x + 1;
                        if (cx >= temp.Width) cx = fr_x;
                        cy = fr_y + 1;
                        if (cy >= temp.Height) cy = fr_y;
                        fx = x * nWidthFactor - fr_x;
                        fy = y * nHeightFactor - fr_y;
                        nx = 1.0 - fx;
                        ny = 1.0 - fy;

                        color1 = temp.GetPixel(fr_x, fr_y);
                        color2 = temp.GetPixel(cx, fr_y);
                        color3 = temp.GetPixel(fr_x, cy);
                        color4 = temp.GetPixel(cx, cy);

                        // Blue
                        bp1 = (byte)(nx * color1.B + fx * color2.B);

                        bp2 = (byte)(nx * color3.B + fx * color4.B);

                        nBlue = (byte)(ny * (double)(bp1) + fy * (double)(bp2));

                        // Green
                        bp1 = (byte)(nx * color1.G + fx * color2.G);

                        bp2 = (byte)(nx * color3.G + fx * color4.G);

                        nGreen = (byte)(ny * (double)(bp1) + fy * (double)(bp2));

                        // Red
                        bp1 = (byte)(nx * color1.R + fx * color2.R);

                        bp2 = (byte)(nx * color3.R + fx * color4.R);

                        nRed = (byte)(ny * (double)(bp1) + fy * (double)(bp2));

                        bmap.SetPixel(x, y, System.Drawing.Color.FromArgb
                (255, nRed, nGreen, nBlue));
                    }
                }

       

                bmap = SetGrayscale(bmap);
                bmap = RemoveNoise(bmap);

                return bmap;
            
        }

SetGrayscale

public Bitmap SetGrayscale(Bitmap img)
            {
    
                Bitmap temp = (Bitmap)img;
                Bitmap bmap = (Bitmap)temp.Clone();
                Color c;
                for (int i = 0; i < bmap.Width; i++)
                {
                    for (int j = 0; j < bmap.Height; j++)
                    {
                        c = bmap.GetPixel(i, j);
                        byte gray = (byte)(.299 * c.R + .587 * c.G + .114 * c.B);
    
                        bmap.SetPixel(i, j, Color.FromArgb(gray, gray, gray));
                    }
                }
                return (Bitmap)bmap.Clone();
    
            }

RemoveNoise

public Bitmap RemoveNoise(Bitmap bmap)
            {
    
                for (var x = 0; x < bmap.Width; x++)
                {
                    for (var y = 0; y < bmap.Height; y++)
                    {
                        var pixel = bmap.GetPixel(x, y);
                        if (pixel.R < 162 && pixel.G < 162 && pixel.B < 162)
                            bmap.SetPixel(x, y, Color.Black);
                        else if (pixel.R > 162 && pixel.G > 162 && pixel.B > 162)
                            bmap.SetPixel(x, y, Color.White);
                    }
                }
    
                return bmap;
            }

INPUT IMAGE
INPUT IMAGE

OUTPUT IMAGE OUTPUT IMAGE

Commines answered 10/12, 2014 at 9:31 Comment(3)
Yes.we have to pass required parameter to Resize method, It will preocess resize,SetGrayscale and RemoveNoise operation then return the output image with better readability.Commines
Tried this approach on a set of files and compared with initial result. In some limited cases it gives better result, mostly there was a slight decrease of output text quality. So, it does not look like an universal solution.Gloriole
This actually worked out pretty well for me. Certainly it gives a starting point for image pre-processing that removes the amount of gibberish you get back from Tesseract.Bicentenary
A
20

What was EXTREMLY HELPFUL to me on this way are the source codes for Capture2Text project. http://sourceforge.net/projects/capture2text/files/Capture2Text/.

BTW: Kudos to it's author for sharing such a painstaking algorithm.

Pay special attention to the file Capture2Text\SourceCode\leptonica_util\leptonica_util.c - that's the essence of image preprocession for this utility.

If you will run the binaries, you can check the image transformation before/after the process in Capture2Text\Output\ folder.

P.S. mentioned solution uses Tesseract for OCR and Leptonica for preprocessing.

Amylum answered 21/3, 2014 at 10:39 Comment(1)
Thank you for the Capture2Text tool. It perfectly solves all the OCR issues in my project!Laggard
S
18

This is somewhat ago but it still might be useful.

My experience shows that resizing the image in-memory before passing it to tesseract sometimes helps.

Try different modes of interpolation. The post https://mcmap.net/q/129027/-how-to-resize-the-buffered-image-n-graphics-2d-in-java helped me a lot.

Sagacity answered 27/5, 2013 at 20:43 Comment(0)
M
14

The Tesseract documentation contains some good details on how to improve the OCR quality via image processing steps.

To some degree, Tesseract automatically applies them. It is also possible to tell Tesseract to write an intermediate image for inspection, i.e. to check how well the internal image processing works (search for tessedit_write_images in the above reference).

More importantly, the new neural network system in Tesseract 4 yields much better OCR results - in general and especially for images with some noise. It is enabled with --oem 1, e.g. as in:

$ tesseract --oem 1 -l deu page.png result pdf

(this example selects the german language)

Thus, it makes sense to test first how far you get with the new Tesseract LSTM mode before applying some custom pre-processing image processing steps.

Misrepresent answered 15/10, 2017 at 20:12 Comment(0)
B
13

Java version for Sathyaraj's code above:

// Resize
public Bitmap resize(Bitmap img, int newWidth, int newHeight) {
    Bitmap bmap = img.copy(img.getConfig(), true);

    double nWidthFactor = (double) img.getWidth() / (double) newWidth;
    double nHeightFactor = (double) img.getHeight() / (double) newHeight;

    double fx, fy, nx, ny;
    int cx, cy, fr_x, fr_y;
    int color1;
    int color2;
    int color3;
    int color4;
    byte nRed, nGreen, nBlue;

    byte bp1, bp2;

    for (int x = 0; x < bmap.getWidth(); ++x) {
        for (int y = 0; y < bmap.getHeight(); ++y) {

            fr_x = (int) Math.floor(x * nWidthFactor);
            fr_y = (int) Math.floor(y * nHeightFactor);
            cx = fr_x + 1;
            if (cx >= img.getWidth())
                cx = fr_x;
            cy = fr_y + 1;
            if (cy >= img.getHeight())
                cy = fr_y;
            fx = x * nWidthFactor - fr_x;
            fy = y * nHeightFactor - fr_y;
            nx = 1.0 - fx;
            ny = 1.0 - fy;

            color1 = img.getPixel(fr_x, fr_y);
            color2 = img.getPixel(cx, fr_y);
            color3 = img.getPixel(fr_x, cy);
            color4 = img.getPixel(cx, cy);

            // Blue
            bp1 = (byte) (nx * Color.blue(color1) + fx * Color.blue(color2));
            bp2 = (byte) (nx * Color.blue(color3) + fx * Color.blue(color4));
            nBlue = (byte) (ny * (double) (bp1) + fy * (double) (bp2));

            // Green
            bp1 = (byte) (nx * Color.green(color1) + fx * Color.green(color2));
            bp2 = (byte) (nx * Color.green(color3) + fx * Color.green(color4));
            nGreen = (byte) (ny * (double) (bp1) + fy * (double) (bp2));

            // Red
            bp1 = (byte) (nx * Color.red(color1) + fx * Color.red(color2));
            bp2 = (byte) (nx * Color.red(color3) + fx * Color.red(color4));
            nRed = (byte) (ny * (double) (bp1) + fy * (double) (bp2));

            bmap.setPixel(x, y, Color.argb(255, nRed, nGreen, nBlue));
        }
    }

    bmap = setGrayscale(bmap);
    bmap = removeNoise(bmap);

    return bmap;
}

// SetGrayscale
private Bitmap setGrayscale(Bitmap img) {
    Bitmap bmap = img.copy(img.getConfig(), true);
    int c;
    for (int i = 0; i < bmap.getWidth(); i++) {
        for (int j = 0; j < bmap.getHeight(); j++) {
            c = bmap.getPixel(i, j);
            byte gray = (byte) (.299 * Color.red(c) + .587 * Color.green(c)
                    + .114 * Color.blue(c));

            bmap.setPixel(i, j, Color.argb(255, gray, gray, gray));
        }
    }
    return bmap;
}

// RemoveNoise
private Bitmap removeNoise(Bitmap bmap) {
    for (int x = 0; x < bmap.getWidth(); x++) {
        for (int y = 0; y < bmap.getHeight(); y++) {
            int pixel = bmap.getPixel(x, y);
            if (Color.red(pixel) < 162 && Color.green(pixel) < 162 && Color.blue(pixel) < 162) {
                bmap.setPixel(x, y, Color.BLACK);
            }
        }
    }
    for (int x = 0; x < bmap.getWidth(); x++) {
        for (int y = 0; y < bmap.getHeight(); y++) {
            int pixel = bmap.getPixel(x, y);
            if (Color.red(pixel) > 162 && Color.green(pixel) > 162 && Color.blue(pixel) > 162) {
                bmap.setPixel(x, y, Color.WHITE);
            }
        }
    }
    return bmap;
}
Beefeater answered 20/7, 2016 at 1:37 Comment(2)
What is your class for Bitmap? Bitmap is not found in Java(It's in Android natively).Halona
This method throughs an exception: Caused by: java.lang.IllegalArgumentException: y must be < bitmap.height()Backboard
C
7

Adaptive thresholding is important if the lighting is uneven across the image. My preprocessing using GraphicsMagic is mentioned in this post: https://groups.google.com/forum/#!topic/tesseract-ocr/jONGSChLRv4

GraphicsMagic also has the -lat feature for Linear time Adaptive Threshold which I will try soon.

Another method of thresholding using OpenCV is described here: https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html

Celt answered 16/1, 2015 at 21:35 Comment(0)
T
2

I did these to get good results out of an image which has not very small text.

  1. Apply blur to the original image.
  2. Apply Adaptive Threshold.
  3. Apply Sharpening effect.

And if the still not getting good results, scale the image to 150% or 200%.

Tillietillinger answered 11/10, 2017 at 20:14 Comment(0)
B
2

Reading text from image documents using any OCR engine have many issues in order get good accuracy. There is no fixed solution to all the cases but here are a few things which should be considered to improve OCR results.

1) Presence of noise due to poor image quality / unwanted elements/blobs in the background region. This requires some pre-processing operations like noise removal which can be easily done using gaussian filter or normal median filter methods. These are also available in OpenCV.

2) Wrong orientation of image: Because of wrong orientation OCR engine fails to segment the lines and words in image correctly which gives the worst accuracy.

3) Presence of lines: While doing word or line segmentation OCR engine sometimes also tries to merge the words and lines together and thus processing wrong content and hence giving wrong results. There are other issues also but these are the basic ones.

This post OCR application is an example case where some image pre-preocessing and post processing on OCR result can be applied to get better OCR accuracy.

Bernadettebernadina answered 23/10, 2017 at 10:5 Comment(0)
E
2

Text Recognition depends on a variety of factors to produce a good quality output. OCR output highly depends on the quality of input image. This is why every OCR engine provides guidelines regarding the quality of input image and its size. These guidelines help OCR engine to produce accurate results.

I have written a detailed article on image processing in python. Kindly follow the link below for more explanation. Also added the python source code to implement those process.

Please write a comment if you have a suggestion or better idea on this topic to improve it.

https://medium.com/cashify-engineering/improve-accuracy-of-ocr-using-image-preprocessing-8df29ec3a033

Eddie answered 18/9, 2018 at 6:43 Comment(1)
Please add an answer here as a summary of your blog. So that even if the link is dead the answer wont be rendered useless.Fedora
J
1

So far, I've played a lot with tesseract 3.x, 4.x and 5.0.0. tesseract 4.x and 5.x seem to yield the exact same accuracy.

Sometimes, I get better results with legacy engine (using --oem 0) and sometimes I get better results with LTSM engine --oem 1. Generally speaking, I get the best results on upscaled images with LTSM engine. The latter is on par with my earlier engine (ABBYY CLI OCR 11 for Linux).

Of course, the traineddata needs to be downloaded from github, since most linux distros will only provide the fast versions. The trained data that will work for both legacy and LTSM engines can be downloaded at https://github.com/tesseract-ocr/tessdata with some command like the following. Don't forget to download the OSD trained data too.

curl -L https://github.com/tesseract-ocr/tessdata/blob/main/eng.traineddata?raw=true -o /usr/share/tesseract/tessdata/eng.traineddata
curl -L https://github.com/tesseract-ocr/tessdata/blob/main/eng.traineddata?raw=true -o /usr/share/tesseract/tessdata/osd.traineddata

I've ended up using ImageMagick as my image preprocessor since it's convenient and can easily run scripted. You can install it with yum install ImageMagick or apt install imagemagick depending on your distro flavor.

So here's my oneliner preprocessor that fits most of the stuff I feed to my OCR:

convert my_document.jpg -units PixelsPerInch -respect-parenthesis \( -compress LZW -resample 300 -bordercolor black -border 1 -trim +repage -fill white -draw "color 0,0 floodfill" -alpha off -shave 1x1 \) \( -bordercolor black -border 2 -fill white -draw "color 0,0 floodfill" -alpha off -shave 0x1 -deskew 40 +repage \) -antialias -sharpen 0x3 preprocessed_my_document.tiff

Basically we:

  • use TIFF format since tesseract likes it more than JPG (decompressor related, who knows)
  • use lossless LZW TIFF compression
  • Resample the image to 300dpi
  • Use some black magic to remove unwanted colors
  • Try to rotate the page if rotation can be detected
  • Antialias the image
  • Sharpen text

The latter image can than be fed to tesseract with:

tesseract -l eng preprocessed_my_document.tiff - --oem 1 -psm 1

Btw, some years ago I wrote the 'poor man's OCR server' which checks for changed files in a given directory and launches OCR operations on all not already OCRed files. pmocr is compatible with tesseract 3.x-5.x and abbyyocr11. See the pmocr project on github.

Juanitajuanne answered 29/12, 2021 at 19:28 Comment(0)
B
1

For scanned documents/images, I find the tools in ScanTailor extremely effective. Some of the useful processes to improve Ocr accuracy are:

  • Despeckling: removing noise (like black dots) within or around the text
  • Deskewing: that the text will correctly align along straight lines
  • Cropping margins: so that unnecessary text/image around the margins will confuse the OCR engine.
  • Content Selection: if you want to ocr just part of the image; you can tell it to remove part of it: and keep the wanted parts.
  • Splitting multicol pages: Sometimes it is useful to split multicolumn material into separate pages to increase the accuracy of the OCR. ScanTailor does that beautifully, and automatically.
Brillatsavarin answered 15/9, 2023 at 9:34 Comment(0)
M
0

you can do noise reduction and then apply thresholding, but that you can you can play around with the configuration of the OCR by changing the --psm and --oem values

try: --psm 5 --oem 2

you can also look at the following link for further details here

Mcneal answered 4/7, 2020 at 5:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.