How to improve tesseract.js accuracy?
Asked Answered
O

1

9

Im using this piece of code from the website but its not accurate enough

 const worker1 = createWorker();
  const worker2 = createWorker();

  await worker1.load();
  await worker2.load();
  await worker1.loadLanguage("eng");
  await worker2.loadLanguage("eng");
  await worker1.initialize("eng");
  await worker2.initialize("eng");

  scheduler.addWorker(worker1);
  scheduler.addWorker(worker2);

  /** Add 10 recognition jobs */
  const {
    data: { text }
  } = await scheduler.addJob("recognize", image);

this is the type of image i'm trying to read its text:

enter image description here

thou it seems simple and easy ,sometimes tesseract fails to read it . is there any better alternatives to tesseract.js or any way to improve the accuracy?

Octillion answered 1/12, 2019 at 13:51 Comment(8)
Have you tried applying some filtering on the input images, to enhance the contrast, for example or enlarge them? I think one way to get better accuracy, is to do some modifications on the input images.Amarelle
actually i have applied some filters and removed some level of noise to make it more clear and performance is improved , but still its unable to read sometimes, i dont know whyOctillion
you suggest any special modifications ?Octillion
You can start with this post: docparser.com/blog/improve-ocr-accuracy Increasing contrast, image sharpening, removing noise are some basic image enhancements that might help you get better accuracy results.Amarelle
Additionally, you might want to check threshold filtering. See this code for example: github.com/laurenzcodes/Canvas-Threshold-EffectAmarelle
You can also dive deeper into edge detection algorithms, like the Sobel Algorithm or Canny Algorithm.Amarelle
I use a negative version of your image and it works fine. Also additional gamma correction looks promising.Jiggle
I am facing accuracy issues as well piping in an HTML canvas with very basic black strokes on a white background. I am getting wildly inconsistent results with even just attempting to detect numbers :/Roe
L
3

When applying OCR using Tesseract, it is important to preprocess the image so that the desired text to detect is in black with the background in white. To do this, you can apply a simple threshold to obtain a binary image. Here's the image after preprocessing:

enter image description here

Result from Tesseract

52024

I implemented this approach in Python OpenCV, but you can adapt a similar strategy into Javascript!

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image and Otsu's Threshold to get a binary image
image = cv2.imread('1.png', 0)
thresh = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Perform OCR
data = pytesseract.image_to_string(thresh, lang='eng', config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.waitKey()
Linton answered 3/12, 2019 at 2:15 Comment(5)
thanks for the answer , do you know any special node js library to achieve that ?Octillion
using jimp i inverted the color and the accuracy is really improved and i think its enough for my current project , but i still need some good library to do that in node js , anyway thanks for your answer.Octillion
Unfortunately, I'm not too familiar with node.js but once you find one you can follow the same approach. Good luck!Linton
Thanks for the hint regarding Jimp; I'm not sure why it shouldn't be possible to port it but I found something that looks similar and runs on Node.js: NimpVidovic
I can recommend using the sharp npm library, it has all these features built inZygosis

© 2022 - 2024 — McMap. All rights reserved.