Tesseract training for a new font
Asked Answered
S

3

31

I'm still new to Tesseract OCR and after using it in my script noticed it had a relatively big error rate for the images I was trying to extract text from. I came across Tesseract training, which supposedly would be able to decrease error rate for a specific font you'd use. I came across a website (http://ocr7.com/) which is a tool powered by Anyline to do all the training for a font you specify. So I recieved a .traineddata file and I am not quite sure what to do with it. Could anybody explain what I have to do with this file for it to work? Or should I just learn how to do Tesseract training the manual way, which according to the Anyline website may take a day's work. Thanks in advance.

Stettin answered 23/12, 2016 at 5:13 Comment(0)
J
3

This might be a late responde, but for the question shows up on Google.

Newer versions of Tesseract come shipped with a bunch of tools to make this really easy, without having to do manual work with a box editor.

text2image lets you generate both the .tif file and its respective .box file for use with tesstrain.

text2image \
    --font=Font Name \
    --fonts_dir=Optional Fonts Dir \
    --text=path/to/textfile
    --outputbase=path/to/output
    --max_pages=1 \
    --leading=32 \
    --xsize=3600 \
    --ysize=480 \
    --char_spacing=1.0 \
    --exposure=0 \
    --unicharset_file=path/to/unicharset

I believe the --unicharset_file parameter may be optional.

Jones answered 20/2, 2024 at 9:33 Comment(1)
This is correct, install tesseract_ocr and the training library.Pastille
S
19

You can use this tool to get a traineddata file of whichever font you want. After that move the traineddata file in your tessdata folder. To use tesseract with the new font in Python put lang = "Font"as the second parameter in the image_to_string function. It improves accuracy significantly but still makes mistakes of course. Or you can just learn how to train tesseract for a new font manually with this guide: http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/.

Stettin answered 27/12, 2016 at 1:41 Comment(3)
Hello, Do you know how I can create font files for training? For example, if I have a couple of devices that I wanna OCR their serial numbers, how do I create font files for them, in order to train Tesseract?Rachael
@Joshua, this question may help with that. Or just search.Semifluid
tessdata on my system was in /usr/share/tesseract-ocr/VERSION/tessdata/Poisoning
J
3

This might be a late responde, but for the question shows up on Google.

Newer versions of Tesseract come shipped with a bunch of tools to make this really easy, without having to do manual work with a box editor.

text2image lets you generate both the .tif file and its respective .box file for use with tesstrain.

text2image \
    --font=Font Name \
    --fonts_dir=Optional Fonts Dir \
    --text=path/to/textfile
    --outputbase=path/to/output
    --max_pages=1 \
    --leading=32 \
    --xsize=3600 \
    --ysize=480 \
    --char_spacing=1.0 \
    --exposure=0 \
    --unicharset_file=path/to/unicharset

I believe the --unicharset_file parameter may be optional.

Jones answered 20/2, 2024 at 9:33 Comment(1)
This is correct, install tesseract_ocr and the training library.Pastille
P
1

If you want to train tesseract with the new font, then generate .traineddata file with your desired font. For generating .traineddata, first you will need .tiff file and .box file. You can create these files using jTessBoxEditor. Tutorial for jBossTextEditor is here. While making .tiff file you can set the font in which you have train tesseract. Either you can jTessBoxEditor for generating .traineddata or serak-tesseract-trainer is also there. I have used both and I would say that for generating tiff and box files jTessBoxEditor is great and for training tesseract use serak.

Perineum answered 13/3, 2019 at 6:9 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.