PDF table extraction

Asked 24/4, 2012 at 15:10 Answered 27/3, 2019 at 12:41

I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! In General, the files contain some header and footer text, as well as a table full of other data in between. I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML:

PDF

PDFBox || iText (Java)
Google Docs Import
PDF2HTML || PDF2Table

GIF

Tesseract-OCR

I have got the best result from parsing the PDF-file with PDFBox, but still (as the menu changes weekly), it is not reliable enough. The HTML that I receive includes sometimes more, sometimes less "paragraphs" (<p>), so that I am not able to parse the data precice enough.

That is why I would like to know if there is an other way to do it?

Posse answered 24/4, 2012 at 15:10 Comment(6)

PDF->text is rarely straightforward. PDF is a document layout language, not a markup language. Depending on how the pdf generator's mood is that day, it can generate entirely different documents each time. – Doralynn 24/4, 2012 at 15:12

I see. The only thing that bothers me is that some pdf to xls parsers work quite fine. So why is not there any open source projects that are also capable to parse a pdf table reliable? – Posse 24/4, 2012 at 15:36

If you can contact the people who write this menu, see what format it is produced in. They might create it in a format that is much easier to extract text from. – Westerfield 24/4, 2012 at 19:17

That was also an option I was thinking of, but there were two problems with it: 1. universities like to hide their information and only make it accessible if they want to and 2. I was also thinking of finding an approach which would be applyable to more cafeterias then just the one I meant ;) I will just continue with my "trial and error" method! – Posse 24/4, 2012 at 19:53

The sample pdf is is located at goo.gl/xc8r3. @njzk2: Why should I forget OCR? – Posse 5/5, 2012 at 9:4

Possible duplicate of Parsing PDF files (especially with tables) with PDFBox – Sheridansherie 15/10, 2017 at 21:21

Tabula is a pretty good start on a JRuby web interface for extracting CSV/TSV tables from arbitrary PDFs.

Gerick answered 29/1, 2014 at 14:50 Comment(3)

Agreed, the accuracy that I've seen so far is outstanding (it mentions that table headers can still be problematic, but I've had no problems with them so far). I just wish there was an API... – Dysteleology 7/4, 2014 at 21:7

Oh, there is. The engine that powers Tabula is tabula-extractor, and you can get it here: github.com/jazzido/tabula-extractor - it's written with jruby, which you'll need, but the instructions are straightforward. – Dysteleology 8/4, 2014 at 19:18

An updated list of tools: okfnlabs.org/blog/2016/04/19/… – Gerick 6/5, 2016 at 18:43

I have implemented my own algorithm ( its name is traprange ) to parse tabular data in pdf files.

Following are some sample pdf files and results:

Input file: sample-1.pdf, result: sample-1.html
Input file: sample-4.pdf, result: sample-4.html

Visit my project page at traprange

or my article at traprange

Remnant answered 12/4, 2015 at 10:41 Comment(1)

great work on this project! you may want to consider adding support for border lines analysis to separate rows and columns, not just by distance – Twopence 9/8, 2016 at 7:26

You can use Camelot to extract tables from your PDF and export it to an HTML file. CSV, Excel and JSON are also supported. You can check out the documentation at: http://camelot-py.readthedocs.io. It gives more accurate results as compared to other open-source table extraction tools and libraries. Here's a comparison.

You can use the following code snippet to go forward with your task:

>>> import camelot
>>> tables = camelot.read_pdf('file.pdf')
>>> type(tables[0].df)
<class 'pandas.core.frame.DataFrame'>
>>> tables[0].to_html('file.html')

Disclaimer: I'm the author of the library.

Reunion answered 21/11, 2018 at 11:39 Comment(0)

If you are looking to extract data from tables once a week and you are on Windows then, please check this freeware pdf utility that includes automated table detection and table to CSV, XML conversion: PDF Viewer utility.

The utility is free for both commercial and non-commercial usage for non-developers (and there is the separate version for developers who want to automate via API).

Disclaimer: I work for ByteScout

Twopence answered 24/2, 2015 at 12:6 Comment(5)

The software is awesome but the prince, not that much for a person where one dollar 1 is almost 4. :( – Proteose 7/8, 2016 at 17:22

@jack pdf utility (PDF Multitool) is completely free, did you mean PDF Extractor SDK? – Twopence 7/8, 2016 at 18:36

I just tested the option to convert to HTML, this is by far the best software for that I've ever found. Did you worked on this software? I want to use that extract within a software so yeah, I mean the SDK. – Proteose 7/8, 2016 at 22:34

@jack is there a way to PM you? – Twopence 9/8, 2016 at 7:22

sure, you can email me at jackj33 at google's mail server – Proteose 9/8, 2016 at 20:59

I have tried many of the OCR and text converter software's and though I believe once should write the program self converting PDF to text as the Image is better understood by the person performing task.

I had also tried to use Google and many other Online (about 900 website) and Offline(about 1000 softwares) products by different companies. If you want to extract text from any method such as OCR or Text from PDF, then most accurate program I found is PDFTOHTML. The accuracy rate of PDFTOHTML is about 98% and Google Online has about 94% accuracy. It is a very good software which also provide you the correct format of text i.e. bold, italic etc of the text.

Counterstatement answered 1/5, 2012 at 18:51 Comment(2)

You're right with the ability of text recognition itself. PDF2HTML provides a quite good result, but it still cannot handle tables within a pdf document - it just cannot recognize their existence. I though, was searching for a "tool" that can also detect tables and convert them (together with the information in it) to data like HTML or XML. – Posse 1/5, 2012 at 22:34

Nobody, nobody in the world can extract the ocr/image to html tables or any other thing. Tables are not used for the purpose of display the text and if the tables have borders then might be it would be possible but quite difficult. One has to deal with 2 things OCR and PDF. Nothing is impossible but very difficult. One has to first extract the text of every position of text from ocr and then mark them as in PDF. Try to make with PS (ghost-script) also as many printing techniques use them. Change your gif image to PS First then to PDF might give to correct answer – Counterstatement 2/5, 2012 at 3:48

for major templates Tabula is the best option for open source while Abbyy PDF editor is a great solution for enterprise-level pdf data extraction and modification. Abbyy works on OCR.

Tabula have two option for auto table detection and another is manually by providing coordinates.

Maigre answered 27/3, 2019 at 12:41 Comment(1)

Although your two answer might be correct. You should post some links to encourage research ;). Also I think the problem @Posse is having is a conceptual one. I think I'll be easier to just extract the data from PDF/PNG/GIF into plain text. With that, then you can create a HTML/XML from it... but the engine will be better, since it has a lower scope/responsability. – Mateya 27/3, 2019 at 14:9

Are the tables in the same place each time? If you can find the dimentions of each box, you could use a tool to split the PDF into multiple documents, each of which contain one box, after which you can use whatever tool you want to convert each smaller PDF to HTML (such as the tools mentioned in other answers). Random Google searches pulled up PyPdf, which looked like it might have some useful functions.

If you aren't able to hard code the size of the box (or want to apply the problem to multiple menus in different formats), the obvious method to me (I said obvious, not easy) would be edge detection to find where the border of the table would be, and then apply the splitting I talked about before.

Aseity answered 3/5, 2012 at 9:41 Comment(3)

The hardcoded approach is not applyable to my situation. Since there are new menus each week with different amount of meals, the table structure varies in the size of the table cells... After reading a lot more stuff on SO and stuff from google, I actually have found a way to detect "data" in images: Hough transformation. It still does not completely fit my demands – Posse 3/5, 2012 at 15:18

@Posse why doesn't the transformation completely "fit [your] demands"? – Aseity 3/5, 2012 at 17:26

Since there are different kind of menus, I would probably need to hardcode a lot of stuff, but I want to make it more generic. So the Hough Transformation would be sufficient, but not efficient enough. – Posse 5/5, 2012 at 9:3

I recently ran into a similar problem.

An alternate solution I found was to open a PDF document in Adobe and export it to xml. At least with my PDF's it preserved the table information and then I was able to programmatically work with the XML to generate tabular files like excel etc.

The other issue I ran into was that Adobe only lets you export one file at a time and I had lots of files. Luckily Adobe also has a merge function. I ended up merging all the files together and then exporting them as one big XML file and working with that file to generate what I needed.

Caniff answered 13/5, 2015 at 15:41 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags