I am trying to automatically extract a scale (scale bar + a number + unit) from an image. Here is an example:
It is used to map pixels to real world measurement.
I am using PyTesseract (installed through Anaconda3).
Here is my code:
import cv2
import pytesseract
import numpy as np
img = cv2.imread('pbmk_scale.tif')
#img = cv2.imread('ocr_test_greek_and_english.png')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening
# Line detection for the scale line
edges = cv2.Canny(gray,50,150,apertureSize = 3)
minLineLength = 100
maxLineGap = 10
lines = cv2.HoughLinesP(edges,1,np.pi/180,100,minLineLength,maxLineGap)
x1,y1,x2,y2 = lines[0][0]
print('Line (' + str(x1) + ',' + str(y1) + ') -- (' + str(x2) + ',' + str(y2) + ')')
# Calculating lenght of scale line in pixels. Since the line is always horizontal we need to just subtract the X coordinates
l = abs(x1 - x2)
print('Line is ' + str(l) + 'px long')
# Text recognition for the scale number and real unit
# FIXME Greek not detected. Is it grc or ell for the configuration? Both don't work
custom_config = r'-l grc+eng --psm 1' # Greek (for mu and nu letters) and English (for m (metre))
text = pytesseract.image_to_string(img, config=custom_config)
print('OUTPUT:', text.split())
number = [int(s) for s in text.split() if s.isdigit()]
print('Number is ' + str(number))
So far it is working quite nicely especially since the image is generated through Helium Ion microscopy and the label (where the scale bar is positioned) is automatically generated and stored along with the image as TIFF. So detecting text and lines is spot on. In addition the scale bar is always in the same location in the image and the actual scale line is always horizontal. The code above has its flaws but I'm more interested in the fact that I am unable to detect anything but English.
Sadly Anaconda is quite cryptographic when it comes to the description of the packages it provides (especially if you look at the navigator). So I did a little digging and under C:\Users\USER_NAME\anaconda3\envs\MachineLearning\tessdata
(with MachineLearning
being my custom virtual environment) I found two things:
- There are only two
.traineddata
files -eng.traineddata
andosd.traineddata
- The
eng.traineddata
is much smaller (almost 10 times) than the one I found in the git repo of the Tesseract project, hosted on GitHub.
I downloaded multiple trained data files (eng, ell and grc). I did a test with just grc and ell (separate plus combined) and a test with both Greek and English in the image. For example the following image (after removing the line detection part of the code above)
yields the following result:
OUTPUT: ['Here’s', 'some', 'GBeek', 'Od10', 'd1ota', 'iumEedit', 'Oy']
I tried various values for the PSM parameter (that made sense of course) but nothing changed.
I am new to OCR and Tesseract so I'm probably missing something quite obvious.