How do I segment a document using Tesseract then output the resulting bounding boxes and labels

Asked 18/2, 2015 at 18:27 Answered 3/3, 2020 at 8:42

I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants had to segment and various documents (academic paper here). Here's an example from that paper illustrating what I want to create: Image of segmented and labelled output

I have built the latest version of tesseract using brew, brew install tesseract --HEAD, and have been trying to edit config files located in /usr/local/Cellar/tesseract/HEAD/share/tessdata/configs/ to output labelled boxes. The output received using hocr as the config, i.e.

tesseract infile.tiff outfile_stem -l eng -psm 1 hocr

gives a bounding box for everything and has some labelling in class tags e.g.

<p class='ocr_par' dir='ltr' id='par_5_82' title="bbox 2194 4490 3842 4589">
    <span class='ocr_line' id='line_5_142' ...

but I can't visualise this. Is there a standard tool to visualize hOCR files, or is the facility to create an output file with bounding boxes built into Tesseract?

The current head version details:

tesseract 3.04.00
 leptonica-1.71
  libjpeg 8d : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.5

Edit

I'm really looking to achieve this using the command line tool (as in examples above). @nguyenq has pointed me to the API reference, unfortunately I have no c++ experience. If the only solution is to use the API, please can you provide a quick python example?

Defeasible answered 18/2, 2015 at 18:27 Comment(0)

Success. Many thanks to the people at the Pattern Recognition and Image Analysis Research Lab (PRImA) for producing tools to handle this. You can obtain them freely on their website or github.

Below I give the full solution for a Mac running 10.10 and using the homebrew package manager. I use wine to run windows executables.

Overview

Download tools: Tesseract OCR to Page (TPT) and Page Viewer (PVT)
Use the TPT to run tesseract on your document and convert the HOCR xml to a PAGE xml
Use the PVT to view the original image with the PAGE xml information overlaid

Code

brew install wine  # takes a little while >10m
brew install gs    # only for generating a tif example. Not required, you can use Preview
brew install wget  # only for downloading example paper. Not required, you can do so manually!
cd ~/Downloads
wget -O paper.pdf "http://www.prima.cse.salford.ac.uk/www/assets/papers/ICDAR2013_Antonacopoulos_HNLA2013.pdf"
# This command can be ommitted and you can do the conversion to tiff with Preview
gs                          \
  -o paper-%d.tif           \
  -sDEVICE=tiff24nc         \
  -r300x300                 \
   paper.pdf 

cd ~/Downloads
# ttptool is the location you downloaded the Tesseract to PAGE tool to
ttptool="/Users/Me/Project/tools/TesseractToPAGE 1.3"
# sudo chmod 777 "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"
touch "$ttptool/log.txt"
wine "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"   \
  -inp-img "$dl/Downloads/paper-3.tif"           \
  -out-xml "$dl/Downloads/paper-3-tool.xml"      \
  -rec-mode layout>>log.txt

# pvtool is the location you downloaded the PAGE Viewer tool to
pvtool="/Users/Me/Project/tools/PAGEViewerMacOS_1.1/JPageViewer 1.1 (Mac OS, 64 bit)"
cd "$pvtool"
dl=~
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3-tool.xml" "$dl/Downloads/paper-3.tif"

Results

Document with overlays (rollover to see text and type) Doc with overlays Overlays alone (use GUI buttons to toggle)

Appendix

You can run tesseract yourself and use another tool to convert its output to PAGE format. I was unable to get this to work but I'm sure you'll be fine!

# Note that the pvtool does take as input HOCR xml but it ignores the region type
brew install tesseract --devel  # installs v 3.03 at time of writing
tesseract ~/Downloads/paper-3.tif ~/Downloads/paper-3 hocr
mv paper-3.hocr paper-3.xml  # The page viewer will only open XML files
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3.xml"

At this point you need to use the PAGE Converter Java Tool to convert the HOCR xml into a PAGE xml. It should go a little something like this:

pctool="/Users/Me/Project/tools/JPageConverter 1.0"
java -jar "$pctool/PageConverter.jar" -source-xml paper-3.xml -target-xml paper-3-hocrconvert.xml -convert-to LATEST

Unfortunately, I kept getting null pointers.

Could not convert to target XML schema format.
java.lang.NullPointerException
    at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:126)
    at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
Could not save target PAGE XML file: paper-3-hocrconvert.xml
java.lang.NullPointerException
    at org.primaresearch.dla.page.io.xml.XmlInputOutput.writePage(XmlInputOutput.java:144)
    at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:135)
    at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)

Defeasible answered 21/2, 2015 at 0:31 Comment(0)

You can use its API to obtain the bounding boxes at various levels (character/word/line/para) -- see API Example. You have to draw the labels yourself.

Unilobed answered 19/2, 2015 at 2:27 Comment(2)

Thanks for your quick answer. Is there no way to do this using the command line tool? – Defeasible 19/2, 2015 at 9:30

The hocr produced by the command-line gives you the word-level resolution. Other than that, you will have to go against the API. – Unilobed 20/2, 2015 at 2:23

If you are python familiar, you can directly use tesserocr library which is a nice python wrapper around the C++ API. Here is a code snippet to draw polygons at block level using PIL:

from PIL import Image, ImageDraw
from tesserocr import PyTessBaseAPI, RIL, iterate_level, PSM

img = Image.open(filename)

results = []
with PyTessBaseAPI() as api:
    api.SetImage(img)
    api.SetPageSegMode(PSM.AUTO_ONLY)
    iterator = api.AnalyseLayout()
    for w in iterate_level(iterator, RIL.BLOCK):
        if w is not None:
            results.append((w.BlockType(), w.BlockPolygon()))
print('Found {} block elements.'.format(len(results)))

draw = ImageDraw.Draw(img)
for block_type, poly in results:
    # you can define a color per block type (see tesserocr.PT for block types list)
    draw.line(poly + [poly[0]], fill=(0, 255, 0), width=2)

Calumet answered 3/3, 2020 at 8:42 Comment(0)

With Tesseract 4.0.0, a command like tesseract source/dir/myimage.tiff target/directory/basefilename hocr will create a basefilename.hocr file with block-, paragraph-, line-, and word-level bounding boxes for the OCR'ed text. Even the command without the hocr config creates a text file with newlines between block-level text, but the hocr format is more explicit.

More config options here: https://github.com/tesseract-ocr/tesseract/tree/master/tessdata/configs

Quincey answered 4/12, 2017 at 17:18 Comment(0)

Shortcut

It is also possible to open HOCR files directly with the PageViewer tool. The file extension has to be .xml, however.

Counterrevolution answered 23/2, 2015 at 9:42 Comment(2)

It's hidden away, but I mention this in the 'Appendix' of my answer. Opening HOCR straight from tesseract shows a file with only 'paragraph' regions, i.e. region types are ignored. Is this expected? – Defeasible 23/2, 2015 at 14:21

I'm not managing to get this to work. Whether I open an out.hocr or an out.xml, I get this message from PageViewer: An XML loading error occured. Please ensure XML validity and try again. (I produced the out.xml by renaming out.xml--should I be doing something different?) – Tantalous 12/8, 2019 at 20:1

The HOCR individual character step is now available in Tesseract since 4.1. Once the installation check, use :

tesseract {image file} {output name} -c tessedit_create_hocr=1 -c hocr_char_boxes=1

Crystie answered 31/7, 2017 at 10:8 Comment(0)

Edit

Overview

Code

Results

Appendix

Recommended topics

Hot tags