Extract / Identify Tables from PDF python [closed]

P

3

52

Are there any open source libraries that support table identification & extraction?

By this I mean:

Identify a table structure exists
Classify the table from its contents
Extract data from the table in a useful output format e.g. JSON / CSV etc.

I have looked through similar questions on this topic and found the following:

PDFMiner which addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I'm wrong)
pdf-table-extract which attempts to address problem 1 but according to the To-Do list, cannot currently identify tables that are separated by whitespace. This is a problem as all tables in my PDFs are separated by whitespace!

Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!

Pericline answered 16/2, 2015 at 0:4 Comment(4)

If you can use tools beyond python, too, you might want to take a look at tabula. – Shinbone 16/2, 2015 at 7:14

thanks. Will definitely look into that. I'm keen on finding a solution in python though because of the speed in which python can be written – Pericline 16/2, 2015 at 22:9

@Alexander McFarlane: Try SLICEmyPDF in 1 of the answers at #56018202 – Poll 29/5, 2022 at 10:20

See https://mcmap.net/q/340221/-how-can-i-extract-tables-as-structured-data-from-pdf-documents/562769 – Leggat 11/2, 2023 at 10:32

L

41

You should definitely have a look at this answer of mine:

Extracting table contents from a collection of PDF files

and also have a look at all the links included therein.

Tabula/TabulaPDF is currently the best table extraction tool that is available for PDF scraping.

Lithic answered 17/2, 2015 at 1:0 Comment(6)

just an update on the effectiveness of this answer... I hacked a solution together using tabula last year to iterate through about 100 PDFs that had a few formats in common. It wasn't pretty but it was the best of the worst and saved significant time. – Pericline 22/4, 2016 at 21:34

Will pypi.python.org/pypi/pdftable satisfy the requirements? – Wadmal 20/9, 2017 at 9:56

it just works on textbased pdfs and not on images is there anything similiar to this where it can extract data from pdf images ? – Acherman 30/11, 2018 at 6:6

@Sundeep: Of course it can only work on text-based PDFs. If you want to extract tables from an image, you have to attempt running a process of OCR (optical character recognition) on the image first and then apply the table extraction on the text. Final result quality will largely depend on success of the OCR step. There is nothing which would be able to extract tables (or texts) directly from image-only PDFs. – Lithic 30/11, 2018 at 8:57

Iam looking for tools that can do that btw thanks for the info @KurtPfeifle – Acherman 30/11, 2018 at 9:7

@Sundeep: You could start looking which tools are mentioned here: stackoverflow.com/questions/tagged/ocr – Lithic 30/11, 2018 at 14:23

T

49

After many fruitful hours of exploring OCR libraries, bounding boxes and clustering algorithms - I found a solution so simple it makes you want to cry!

I hope you are using Linux;

pdftotext -layout NAME_OF_PDF.pdf

AMAZING!!

Now you have a nice text file with all the information lined up in nice columns, now it is trivial to format into a csv etc..

It is for times like this that I love Linux, these guys came up with AMAZING solutions to everything, and put it there for FREE!

Trinary answered 20/8, 2017 at 22:20 Comment(4)

I was able to get pdftotext on Windows 10. Just download the [XPDFTools][1] for Windows. [1]: xpdfreader.com/download.html – Schrecklichkeit 16/2, 2018 at 11:49

@Schrecklichkeit just got back to it today... basically I'm taking the text file generated by pdftotext, splitting it into a list of strings (one for each line), creating a set of field names using csv.DictWriter(), and then looping through each line, slicing it into the fields I want and then feeding those back to DictWriter. HTH. gist.github.com/memilanuk/c6e0bb9f98076a172d4f39d044ed6ecf – Berton 19/2, 2018 at 21:5

This library is written for Python 2.x version and doesn't work's with Python 3.x. – Chipper 26/2, 2019 at 16:47

Also possible on Mac superuser.com/questions/56272/… – Marital 7/7 at 14:19

L

41

You should definitely have a look at this answer of mine:

Extracting table contents from a collection of PDF files

and also have a look at all the links included therein.

Tabula/TabulaPDF is currently the best table extraction tool that is available for PDF scraping.

Lithic answered 17/2, 2015 at 1:0 Comment(6)

just an update on the effectiveness of this answer... I hacked a solution together using tabula last year to iterate through about 100 PDFs that had a few formats in common. It wasn't pretty but it was the best of the worst and saved significant time. – Pericline 22/4, 2016 at 21:34

Will pypi.python.org/pypi/pdftable satisfy the requirements? – Wadmal 20/9, 2017 at 9:56

it just works on textbased pdfs and not on images is there anything similiar to this where it can extract data from pdf images ? – Acherman 30/11, 2018 at 6:6

@Sundeep: Of course it can only work on text-based PDFs. If you want to extract tables from an image, you have to attempt running a process of OCR (optical character recognition) on the image first and then apply the table extraction on the text. Final result quality will largely depend on success of the OCR step. There is nothing which would be able to extract tables (or texts) directly from image-only PDFs. – Lithic 30/11, 2018 at 8:57

Iam looking for tools that can do that btw thanks for the info @KurtPfeifle – Acherman 30/11, 2018 at 9:7

@Sundeep: You could start looking which tools are mentioned here: stackoverflow.com/questions/tagged/ocr – Lithic 30/11, 2018 at 14:23

S

18

I'd just like to add to the very helpful answer from Kurt Pfeifle - there is now a Python wrapper for Tabula, and this seems to work very well so far: https://github.com/chezou/tabula-py

This will convert your PDF table to a Pandas data frame. You can also set the area in x,y co-ordinates which is obviously very handy for irregular data.

Servility answered 22/4, 2017 at 10:38 Comment(0)

Recommended topics

Hot tags