How do I train tesseract 4 with image data instead of a font file?

Asked 11/4, 2017 at 17:47 Answered 24/10, 2023 at 4:25

I'm trying to train Tesseract 4 with images instead of fonts.

In the docs they are explaining only the approach with fonts, not with images.

I know how it works, when I use a prior version of Tesseract but I didn't get how to use the box/tiff files to train with LSTM in Tesseract 4.

I looked into tesstrain.sh, which is used to generate LSTM training data but couldn't find anything helpful. Any ideas?

Erysipelas answered 11/4, 2017 at 17:47 Comment(0)

Clone the tesstrain repo at https://github.com/tesseract-ocr/tesstrain.

You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)

Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth

Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png that is a picture of the text foobar and you should have a text file named 001.gt.txt that has the text foobar.

These files need to be single lines of text.

In the tesstrain repo, run this command:

make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best

Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.

Then, you can run tesseract and use that model as a language.

tesseract -l my-custom-model foo.png -

Febri answered 28/4, 2020 at 13:17 Comment(5)

Hey, thanks for this answer. A question, I have about 200 pngs for each letter of the alphabet, so should i create the text files as a_1.gt.txt, a_2.g.txt etc with content "a", with images a_1.png, a_2.png etc – Lordly 11/7, 2020 at 7:33

file name can be anything only maters is .gt.txt file and .png file name should be same. a_1.gt.txt, a_1.png, a_2.gt.txt , a_2.png is correct. – Mintun 12/7, 2020 at 11:14

Thanks for the answer! Does this allow adding new symbols? Is there a tool that helps creating those image/text files - for example by allowing one to supply a page as image, which should generate the line images and first guesses of the text? – Chairman 31/12, 2020 at 14:52

I don't know of anything that turns a page into images of lines. But if you get that far, a tip for quickly reviewing and cleaning the first guesses is a program called feh. feh lets you view an image and a caption at the same time and lets you edit the caption from within feh. github.com/eihli/image-table-ocr/blob/… – Febri 31/12, 2020 at 21:4

hocr-extract-images from "hocr-tools" will convert a .hocr file (generated by Tesseract) plus the image to a set of line images/text pairs. – Ternate 26/7, 2021 at 3:59

I recently wrote a website http://www.tesstrain.com/ to perform tesseract OCR training for numbers in screenshot bitmap. It will work for letters. It is based on free tier cloud services.

You can create an account:

upload the images and then edit related text.
perform tesseract train
download train result.

Brathwaite answered 24/10, 2023 at 4:25 Comment(0)

Recommended topics

Hot tags