I had been getting really good results using pytesseract but it is not able to preserve double spaces and they are really important for me. And, so i decided to retrieve hocr output rather than pure text.But;there doesn't appear to be any way of specifying config file using pytessearct.
So, is it possible to specify cofiguration file using pytesseract or is there some default config file that i can change to get hocr output?
#run method from pytessearct.py
def run_tesseract(input_filename, output_filename_base, lang=None, boxes=False, config=None):
'''
runs the command:
`tesseract_cmd` `input_filename` `output_filename_base`
returns the exit status of tesseract, as well as tesseract's stderr output
'''
command = [tesseract_cmd, input_filename, output_filename_base]
if lang is not None:
command += ['-l', lang]
if boxes:
command += ['batch.nochop', 'makebox']
if config:
command += shlex.split(config)
#command+=['C:\\Program Files (x86)\\Tesseract-OCR\\tessdata\\configs\\hocr']
#print "command:",command
proc = subprocess.Popen(command,
stderr=subprocess.PIPE)
return (proc.wait(), proc.stderr.read())