Tesseract OCR force pattern
Asked Answered
P

2

13

I want to read a specific character sequence with Tesseract like this post : Tesseract OCR: is it possible to force a specific pattern?

I have tried bazaar matching pattern in Tesseract with the pattern \d\d\d\A\A and OCR still recognize other words which doesn't match.

I have tried to use the "tessedit_char_whitelist" parameter but I can't choose the position of the characters with that.

  • I launch the command : tesseract image.jpg result -l eng bazaar And I have this message :

Please provide at least 4 concrete characters at the beginning of the pattern

Invalid user pattern \A\A\d\d\d

Tesseract Open Source OCR Engine v3.01 with Leptonica

  • image.jpg :

enter image description here

  • The result :

      AB123
      ABC12
      A1234
      12345
      ABCD1
    

So it is wrong, I just wanted to catch the sequence "AB123".

Can somebody tell me why the regular expression in my user-patterns file as no effect ? For the configuration, I have strictly followed the bazaar tutorial.

Paraphernalia answered 7/8, 2015 at 9:33 Comment(7)
I believe this error: Please provide at least 4 concrete characters at the beginning of the pattern pretty much explains itself. This is probably a limitation from whatever you are using is. Also try \w\w\d\d\d, \A is not what you want for all "characters". Try it here.Sedberry
I tried \w\w\d\d\d and I have the same error : Please provide at least 4 concrete characters at the beginning of the pattern Invalid user pattern \w\w\d\d\d.Paraphernalia
I have added 4 concrete characters to my pattern : TEST\w\w\d\d\d and tested with the words TESTAB123 TESTABC12 etc ... I have no more the error Please provide at least 4 concrete characters at the beginning of the pattern but I still have Invalid user pattern TEST\w\w\d\d\d. I don't understand why it is invalidParaphernalia
Because \w\w are not recognize by tesseract. I tried to use \c\c and I have no more error message. But the result is stil wrong, is like tesseract ignore totally the regex...Paraphernalia
Did you try [A-Z][A-Z][0-9][0-9][0-9]? Did you define it in /path/to/eng.user-patterns? Does /path/to/configs/bazaar contain user_patterns_suffix user-patterns? Just guessing...Bulb
Yes and yes. The result is the same. There is no error, it just does nothing. I'm on windows 8 btw and I am editing the file with the unix line ending <LF> with notepad2Paraphernalia
This feature most probably doesn't work anymore. github.com/tesseract-ocr/tesseract/issues/960Priestley
S
0

If you add the option --oem 0 (OCR Engine mode for Tesseract only), the --user-patterns option is properly enforced. See this PR comment.

For a detailed example, you can read this answer.

Shown answered 13/2, 2024 at 9:35 Comment(0)
P
-1

Try using this pattern with quantifiers instead.

[a-zA-Z]{2}\d{3}

This should cover only 2 alphabetical characters and 3 digits.

The reason why you are matching everything before is because \w is alphanumeric.

Patroclus answered 11/8, 2019 at 10:20 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.