Tesseract OCR force pattern

About

Asked 7/8, 2015 at 9:33 Answered 13/2, 2024 at 9:35

I want to read a specific character sequence with Tesseract like this post : Tesseract OCR: is it possible to force a specific pattern?

I have tried bazaar matching pattern in Tesseract with the pattern \d\d\d\A\A and OCR still recognize other words which doesn't match.

I have tried to use the "tessedit_char_whitelist" parameter but I can't choose the position of the characters with that.

I launch the command : tesseract image.jpg result -l eng bazaar And I have this message :

Please provide at least 4 concrete characters at the beginning of the pattern

Invalid user pattern \A\A\d\d\d

Tesseract Open Source OCR Engine v3.01 with Leptonica

image.jpg :

The result :

  AB123
  ABC12
  A1234
  12345
  ABCD1

So it is wrong, I just wanted to catch the sequence "AB123".

Can somebody tell me why the regular expression in my user-patterns file as no effect ? For the configuration, I have strictly followed the bazaar tutorial.

Paraphernalia answered 7/8, 2015 at 9:33 Comment(7)

I believe this error: Please provide at least 4 concrete characters at the beginning of the pattern pretty much explains itself. This is probably a limitation from whatever you are using is. Also try \w\w\d\d\d, \A is not what you want for all "characters". Try it here. – Sedberry 7/8, 2015 at 9:42

I tried \w\w\d\d\d and I have the same error : Please provide at least 4 concrete characters at the beginning of the pattern Invalid user pattern \w\w\d\d\d. – Paraphernalia 7/8, 2015 at 9:52

I have added 4 concrete characters to my pattern : TEST\w\w\d\d\d and tested with the words TESTAB123 TESTABC12 etc ... I have no more the error Please provide at least 4 concrete characters at the beginning of the pattern but I still have Invalid user pattern TEST\w\w\d\d\d. I don't understand why it is invalid – Paraphernalia 7/8, 2015 at 10:11

Because \w\w are not recognize by tesseract. I tried to use \c\c and I have no more error message. But the result is stil wrong, is like tesseract ignore totally the regex... – Paraphernalia 7/8, 2015 at 10:14

Did you try [A-Z][A-Z][0-9][0-9][0-9]? Did you define it in /path/to/eng.user-patterns? Does /path/to/configs/bazaar contain user_patterns_suffix user-patterns? Just guessing... – Bulb 7/8, 2015 at 12:33

Yes and yes. The result is the same. There is no error, it just does nothing. I'm on windows 8 btw and I am editing the file with the unix line ending <LF> with notepad2 – Paraphernalia 7/8, 2015 at 13:0

This feature most probably doesn't work anymore. github.com/tesseract-ocr/tesseract/issues/960 – Priestley 19/4, 2018 at 9:32

If you add the option --oem 0 (OCR Engine mode for Tesseract only), the --user-patterns option is properly enforced. See this PR comment.

For a detailed example, you can read this answer.

Shown answered 13/2, 2024 at 9:35 Comment(0)

-1

Try using this pattern with quantifiers instead.

[a-zA-Z]{2}\d{3}

This should cover only 2 alphabetical characters and 3 digits.

The reason why you are matching everything before is because \w is alphanumeric.

Patroclus answered 11/8, 2019 at 10:20 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags