How to get Hocr output using python-tesseract
Asked Answered
S

4

5

I had been getting really good results using pytesseract but it is not able to preserve double spaces and they are really important for me. And, so i decided to retrieve hocr output rather than pure text.But;there doesn't appear to be any way of specifying config file using pytessearct.

So, is it possible to specify cofiguration file using pytesseract or is there some default config file that i can change to get hocr output?

#run method from pytessearct.py
def run_tesseract(input_filename, output_filename_base, lang=None, boxes=False, config=None):
    '''
    runs the command:
        `tesseract_cmd` `input_filename` `output_filename_base`

    returns the exit status of tesseract, as well as tesseract's stderr output

    '''
    command = [tesseract_cmd, input_filename, output_filename_base]

    if lang is not None:
        command += ['-l', lang]

    if boxes:
        command += ['batch.nochop', 'makebox']

    if config:
        command += shlex.split(config)
    #command+=['C:\\Program Files (x86)\\Tesseract-OCR\\tessdata\\configs\\hocr']
    #print "command:",command
    proc = subprocess.Popen(command,
            stderr=subprocess.PIPE)
    return (proc.wait(), proc.stderr.read())
Stephainestephan answered 13/12, 2015 at 6:10 Comment(1)
you just need the new option "preserve_interword_spaces=1" so your final config would look like : custom_config = '--oem 1 --psm 11 -c preserve_interword_spaces=1'Bartholemy
M
6

You can use this another library to use Tesseract in Python: pyslibtesseract

Image:

enter image description here

Code:

import pyslibtesseract

tesseract_config = pyslibtesseract.TesseractConfig(psm=pyslibtesseract.PageSegMode.PSM_SINGLE_LINE, hocr=True)
print(pyslibtesseract.LibTesseract.simple_read(tesseract_config, 'phrase0.png'))

Output:

  <div class='ocr_page' id='page_1' title='image ""; bbox 0 0 319 33; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 0 0 319 33">
    <p class='ocr_par' dir='ltr' id='par_1_1' title="bbox 10 13 276 25">
     <span class='ocr_line' id='line_1_1' title="bbox 10 13 276 25; baseline 0 0"><span class='ocrx_word' id='word_1_1' title='bbox 10 14 41 25; x_wconf 75' lang='eng' dir='ltr'><strong>the</strong></span> <span class='ocrx_word' id='word_1_2' title='bbox 53 13 97 25; x_wconf 84' lang='eng' dir='ltr'><strong>book</strong></span> <span class='ocrx_word' id='word_1_3' title='bbox 111 13 129 25; x_wconf 79' lang='eng' dir='ltr'><strong>is</strong></span> <span class='ocrx_word' id='word_1_4' title='bbox 143 17 164 25; x_wconf 83' lang='eng' dir='ltr'>on</span> <span class='ocrx_word' id='word_1_5' title='bbox 178 14 209 25; x_wconf 75' lang='eng' dir='ltr'><strong>the</strong></span> <span class='ocrx_word' id='word_1_6' title='bbox 223 14 276 25; x_wconf 76' lang='eng' dir='ltr'><strong>table</strong></span> 
     </span>
    </p>
   </div>
  </div>
Madi answered 5/1, 2016 at 13:5 Comment(7)
I am getting OSError: dlopen(/usr/local/lib/python2.7/site-packages/pyslibtesseract/cppcode/libpyslibtesseract.so, 6): no suitable image found. Did find: /usr/local/lib/python2.7/site-packages/pyslibtesseract/cppcode/libpyslibtesseract.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00 :/Locarno
@InêsMartins You need use Python3Madi
thanks @Macabeus :/ it does not work with python 2.x?Locarno
@InêsMartins No, and python2 have another package to integrate with tesseractMadi
"python2 have another package to integrate with tesseract", what package? Can you tell me pls @Macabeus? It is pytesseract?Locarno
I started an virtual environment to test with python3 but I am in trouble getting "RuntimeError: pyslibtesseract.so was not generated!"Locarno
@InêsMartins Could you install the package from Github? github.com/brunomacabeusbr/pyslibtesseract MaybeMadi
E
5

This worked for me :)

from pytesseract import pytesseract
pytesseract.run_tesseract('image.png', 'output', lang=None, boxes=False, config="hocr")

where : image.png is the image file besides this python file. The output file named output.hocr will be generated next to these files. Open this file in text editor to see the hocr output.

Expurgate answered 29/2, 2016 at 12:10 Comment(1)
it tells me that there is no module run_tesseract I can't even find a proper documentation of pytesseract.. what is wrong with these ppl?Toms
G
1

I've edited krozaine's answer:

import pytesseract

pytesseract.run_and_get_output('image.jpg', 'hocr', lang=None, config="hocr")
Germanism answered 5/5, 2022 at 13:24 Comment(0)
D
0

Just add hocr at the end of your command like this

tesseract input_filename output_filename_base hocr


Output file will be a html file

Declarant answered 15/12, 2015 at 6:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.