Tesseract - unable to recognize Greek letters at all
Asked Answered
N

2

7

I am trying to automatically extract a scale (scale bar + a number + unit) from an image. Here is an example:

enter image description here

It is used to map pixels to real world measurement.

I am using PyTesseract (installed through Anaconda3).

Here is my code:

import cv2
import pytesseract
import numpy as np

img = cv2.imread('pbmk_scale.tif')
#img = cv2.imread('ocr_test_greek_and_english.png')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening

# Line detection for the scale line
edges = cv2.Canny(gray,50,150,apertureSize = 3)
minLineLength = 100
maxLineGap = 10
lines = cv2.HoughLinesP(edges,1,np.pi/180,100,minLineLength,maxLineGap)
x1,y1,x2,y2 = lines[0][0]
print('Line (' + str(x1) + ',' + str(y1) + ') -- (' + str(x2) + ',' + str(y2) + ')')
# Calculating lenght of scale line in pixels. Since the line is always horizontal we need to just subtract the X coordinates
l = abs(x1 - x2)
print('Line is ' + str(l) + 'px long')

# Text recognition for the scale number and real unit
# FIXME Greek not detected. Is it grc or ell for the configuration? Both don't work
custom_config = r'-l grc+eng --psm 1' # Greek (for mu and nu letters) and English (for m (metre))
text = pytesseract.image_to_string(img, config=custom_config)
print('OUTPUT:', text.split())
number = [int(s) for s in text.split() if s.isdigit()]
print('Number is ' + str(number))

So far it is working quite nicely especially since the image is generated through Helium Ion microscopy and the label (where the scale bar is positioned) is automatically generated and stored along with the image as TIFF. So detecting text and lines is spot on. In addition the scale bar is always in the same location in the image and the actual scale line is always horizontal. The code above has its flaws but I'm more interested in the fact that I am unable to detect anything but English.

Sadly Anaconda is quite cryptographic when it comes to the description of the packages it provides (especially if you look at the navigator). So I did a little digging and under C:\Users\USER_NAME\anaconda3\envs\MachineLearning\tessdata (with MachineLearning being my custom virtual environment) I found two things:

  • There are only two .traineddata files - eng.traineddata and osd.traineddata
  • The eng.traineddata is much smaller (almost 10 times) than the one I found in the git repo of the Tesseract project, hosted on GitHub.

I downloaded multiple trained data files (eng, ell and grc). I did a test with just grc and ell (separate plus combined) and a test with both Greek and English in the image. For example the following image (after removing the line detection part of the code above)

enter image description here

yields the following result:

OUTPUT: ['Here’s', 'some', 'GBeek', 'Od10', 'd1ota', 'iumEedit', 'Oy']

I tried various values for the PSM parameter (that made sense of course) but nothing changed.

I am new to OCR and Tesseract so I'm probably missing something quite obvious.

Nadabas answered 1/11, 2020 at 13:56 Comment(0)
C
1

it is missing the command to say where is yout exe tesseract file

pytesseract.pytesseract.tesseract_cmd = 'D:/Tesseract-OCR/tesseract.exe'
custom_config = r'-l grc+eng --psm 1' # Greek (for mu and nu letters) and English (for m (metre))
Candlemas answered 27/2, 2023 at 14:28 Comment(1)
I cannot test this right now (no long working on this, found a workaround). However isn't this too much? Shouldn't the library hide such details from the developer/user of the script? I can be working on Linux or MacOS, which would force me to branch my code on this location and take into consideration the OS the script is running on at any given point in time. This is very tedious and prone to errors. I would hope there is a better solution than this one.Nadabas
P
0

Try to use png images and that code:

from PIL import Image
from pytesseract import *

img_path = r'your image path'
tessdata_dir_config = r'C:\Program Files\Tesseract-OCR\tessdata'
language = 'grc'

def process_image(iamge_name, lang_code, tessdata_dir_config):
    return pytesseract.image_to_string(Image.open(iamge_name), lang=lang_code, config=tessdata_dir_config)

def print_data(data):
    print(data)

def output_file(filename, data):
    file = open(filename, "w+")
    file.write(data)
    file.close()

def main():
    data_gr = process_image(img_path, language, tessdata_dir_config)
    print_data(data_gr)
    #output_file('my_ocr', data_gr)

if  __name__ == '__main__':
    main()
Praxiteles answered 23/10, 2022 at 8:56 Comment(1)
how does that answer to the problem stated by OP ? what are the outputs ?Won

© 2022 - 2024 — McMap. All rights reserved.