Convert hOCR to HTML table

Asked 24/6, 2015 at 14:45 Answered 6/3, 2017 at 18:18

I am looking for a tool or an idea to be implemented in python that convert hOCR file (generated by tesseract in by application) to html table. The idea is to utilize the text location information in hOCR file (provided in bbox attribute) to create a table based the location provided. I am providing an example explains the above idea:

I used this image from SlideShare.net as input to my application that utilize tesseract and I got the below hOCR/xml file as output.

hOCR file:

  <div class='ocr_page' id='page_2' title='image "sample_slide.jpg"; bbox 0 0 638 479; ppageno 1'>
   <div class='ocr_carea' id='block_1_1' title="bbox 0 0 638 479">
    <p class='ocr_par' dir='ltr' id='par_1' title="bbox 31 104 620 439">
     <span class='ocr_line' id='line_1' title="bbox 32 104 613 138"><span class='ocrx_word' id='word_1' title="bbox 32 105 119 131">done:</span> <span class='ocrx_word' id='word_2' title="bbox 132 104 262 138">working</span> <span class='ocrx_word' id='word_3' title="bbox 273 105 405 138">product,</span> <span class='ocrx_word' id='word_4' title="bbox 419 104 517 132">hotels</span> <span class='ocrx_word' id='word_5' title="bbox 528 104 613 132">listed</span> 
     </span>
     <span class='ocr_line' id='line_2' title="bbox 31 160 471 194"><span class='ocrx_word' id='word_6' title="bbox 31 164 62 187">to</span> <span class='ocrx_word' id='word_7' title="bbox 75 161 122 187">do:</span> <span class='ocrx_word' id='word_8' title="bbox 134 164 227 187">smart</span> <span class='ocrx_word' id='word_9' title="bbox 236 160 330 187">trafﬁc</span> <span class='ocrx_word' id='word_10' title="bbox 342 160 471 194">building</span> 
     </span>
     <span class='ocr_line' id='line_3' title="bbox 32 243 284 280"><span class='ocrx_word' id='word_11' title="bbox 32 243 128 280">seed</span> <span class='ocrx_word' id='word_12' title="bbox 148 243 284 280">round:</span> 
     </span>
     <span class='ocr_line' id='line_4' title="bbox 71 316 619 361"><span class='ocrx_word' id='word_13' title="bbox 71 321 156 356">CEO</span> <span class='ocrx_word' id='word_14' title="bbox 171 319 240 355">will</span> <span class='ocrx_word' id='word_15' title="bbox 260 321 384 356">invest</span> <span class='ocrx_word' id='word_16' title="bbox 517 316 619 361">$30k</span> 
     </span>
     <span class='ocr_line' id='line_5' title="bbox 75 392 620 439"><span class='ocrx_word' id='word_17' title="bbox 75 397 252 433">investor</span> <span class='ocrx_word' id='word_18' title="bbox 489 392 620 439">$120k</span> 
     </span>
    </p>
   </div>
  </div>

What I need is to convert the hOCR file to html table based on the location of the next. The intended table should look something like this table.

The size and location of the table cells reflect the information provided in the hOCR file.

Image source: slideshare.net

Sulfate answered 24/6, 2015 at 14:45 Comment(1)

github.com/ultrasaurus/hocr-javascript – Jannajannel 5/8, 2016 at 9:1

Check this document. I believe it describes much (or all) of what you need. From the introduction:

This document describes a representation of various aspects of OCR output in an XML-like format. That is, we define as set of tags containing text and other tags, together with attributes of those tags. However, since the content we are representing is formatted text, However, we are not actually using a new XML for the representation; instead embed the representation in XHTML (or HTML) because XHTML and XHTML processing already define many aspects of OCR output representation that would otherwise need additional, separate and ad-hoc definitions.

The XML can also be converted to HTML using XSLT. In fact, there is a project which plans to do just that.

Also, this project (hocr-tools) may be of help.

Finally note that the FAQ of Tesseract mentions this:

With the configfile 'hocr' tesseract will produce xhtml output compliant with hocr specification

Gamophyllous answered 24/6, 2015 at 15:48 Comment(2)

Thanks jcoppens for your answer. In fact the document and the tool does not provide what I am looking for. They defined the hOCR standards and formats without mentioning how to present the output as html table. The tools is useful for some tasks but again it does not provide facility to produce the output I need. Thanks again. – Sulfate 24/6, 2015 at 16:0

XHTML not appropiate? As described in the FAQ? Also, in the Spec there is a reference to XSL. XSLT is a tool to convert XML, and can be used to create HTML (Added the reference to the answer above) – Gamophyllous 24/6, 2015 at 16:3

Here is an idea how to convert a hocr file with some existing tools into a table (also it might be too late for the original question):

Take the hocr file together with the image file and create a pdf with hocr-pdf from the hocr-tools repo see https://github.com/tmbdev/hocr-tools#hocr-pdf
Use tabula https://github.com/tabulapdf/tabula to extract the table data from the pdf
Convert the CSV data to HTML table (there should be plenty of tools for this task)

The first step is only needed because tabula works only with pdfs. The second step is IMO the main challenge to extract table data from visual information, and might also be interesting to check the details there, when you want to get some ideas about algorithmic approaches.

Lives answered 6/3, 2017 at 18:18 Comment(0)

Recommended topics

Hot tags