How to get Hocr output using python-tesseract

Asked 13/12, 2015 at 6:10 Answered 5/5, 2022 at 13:24

I had been getting really good results using pytesseract but it is not able to preserve double spaces and they are really important for me. And, so i decided to retrieve hocr output rather than pure text.But;there doesn't appear to be any way of specifying config file using pytessearct.

So, is it possible to specify cofiguration file using pytesseract or is there some default config file that i can change to get hocr output?

#run method from pytessearct.py
def run_tesseract(input_filename, output_filename_base, lang=None, boxes=False, config=None):
    '''
    runs the command:
        `tesseract_cmd` `input_filename` `output_filename_base`

    returns the exit status of tesseract, as well as tesseract's stderr output

    '''
    command = [tesseract_cmd, input_filename, output_filename_base]

    if lang is not None:
        command += ['-l', lang]

    if boxes:
        command += ['batch.nochop', 'makebox']

    if config:
        command += shlex.split(config)
    #command+=['C:\\Program Files (x86)\\Tesseract-OCR\\tessdata\\configs\\hocr']
    #print "command:",command
    proc = subprocess.Popen(command,
            stderr=subprocess.PIPE)
    return (proc.wait(), proc.stderr.read())

Stephainestephan answered 13/12, 2015 at 6:10 Comment(1)

you just need the new option "preserve_interword_spaces=1" so your final config would look like : custom_config = '--oem 1 --psm 11 -c preserve_interword_spaces=1' – Bartholemy 11/10, 2020 at 9:47

You can use this another library to use Tesseract in Python: pyslibtesseract

Image:

Code:

import pyslibtesseract

tesseract_config = pyslibtesseract.TesseractConfig(psm=pyslibtesseract.PageSegMode.PSM_SINGLE_LINE, hocr=True)
print(pyslibtesseract.LibTesseract.simple_read(tesseract_config, 'phrase0.png'))

Output:

  <div class='ocr_page' id='page_1' title='image ""; bbox 0 0 319 33; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 0 0 319 33">
    <p class='ocr_par' dir='ltr' id='par_1_1' title="bbox 10 13 276 25">
     <span class='ocr_line' id='line_1_1' title="bbox 10 13 276 25; baseline 0 0"><span class='ocrx_word' id='word_1_1' title='bbox 10 14 41 25; x_wconf 75' lang='eng' dir='ltr'><strong>the</strong></span> <span class='ocrx_word' id='word_1_2' title='bbox 53 13 97 25; x_wconf 84' lang='eng' dir='ltr'><strong>book</strong></span> <span class='ocrx_word' id='word_1_3' title='bbox 111 13 129 25; x_wconf 79' lang='eng' dir='ltr'><strong>is</strong></span> <span class='ocrx_word' id='word_1_4' title='bbox 143 17 164 25; x_wconf 83' lang='eng' dir='ltr'>on</span> <span class='ocrx_word' id='word_1_5' title='bbox 178 14 209 25; x_wconf 75' lang='eng' dir='ltr'><strong>the</strong></span> <span class='ocrx_word' id='word_1_6' title='bbox 223 14 276 25; x_wconf 76' lang='eng' dir='ltr'><strong>table</strong></span> 
     </span>
    </p>
   </div>
  </div>

Madi answered 5/1, 2016 at 13:5 Comment(7)

I am getting OSError: dlopen(/usr/local/lib/python2.7/site-packages/pyslibtesseract/cppcode/libpyslibtesseract.so, 6): no suitable image found. Did find: /usr/local/lib/python2.7/site-packages/pyslibtesseract/cppcode/libpyslibtesseract.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00 :/ – Locarno 18/2, 2016 at 10:3

@InêsMartins You need use Python3 – Madi 18/2, 2016 at 18:0

thanks @Macabeus :/ it does not work with python 2.x? – Locarno 18/2, 2016 at 18:15

@InêsMartins No, and python2 have another package to integrate with tesseract – Madi 18/2, 2016 at 20:53

"python2 have another package to integrate with tesseract", what package? Can you tell me pls @Macabeus? It is pytesseract? – Locarno 12/5, 2016 at 9:56

I started an virtual environment to test with python3 but I am in trouble getting "RuntimeError: pyslibtesseract.so was not generated!" – Locarno 12/5, 2016 at 15:13

@InêsMartins Could you install the package from Github? github.com/brunomacabeusbr/pyslibtesseract Maybe – Madi 13/5, 2016 at 1:46

This worked for me :)

from pytesseract import pytesseract
pytesseract.run_tesseract('image.png', 'output', lang=None, boxes=False, config="hocr")

where : image.png is the image file besides this python file. The output file named output.hocr will be generated next to these files. Open this file in text editor to see the hocr output.

Expurgate answered 29/2, 2016 at 12:10 Comment(1)

it tells me that there is no module run_tesseract I can't even find a proper documentation of pytesseract.. what is wrong with these ppl? – Toms 16/4, 2018 at 16:44

I've edited krozaine's answer:

import pytesseract

pytesseract.run_and_get_output('image.jpg', 'hocr', lang=None, config="hocr")

Recommended topics

Hot tags