Clone the tesstrain repo at https://github.com/tesseract-ocr/tesstrain.
You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)
Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth
Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png
that is a picture of the text foobar
and you should have a text file named 001.gt.txt
that has the text foobar
.
These files need to be single lines of text.
In the tesstrain
repo, run this command:
make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best
Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.
Then, you can run tesseract and use that model as a language.
tesseract -l my-custom-model foo.png -
a_1.gt.txt
,a_2.g.txt
etc with content "a", with imagesa_1.png
,a_2.png
etc – Lordly