parse tables from a PDF document

Asked 24/3, 2014 at 21:40 Answered 25/11, 2021 at 2:43

Solved python parsing pdf pdfbox apache-tika

The PDF in this link (http://www.lenovo.com/psref/pdf/psref450.pdf) contains a number of tables like this:

enter image description here

I'd like to programmatically extract the data and the structure from these tables.

Things I've tried: converting the PDF to HTML using

Tika: Unfortunately, the tables are converted to space delimited paragraphs - and some of the strings contain spaces so it's notpossible to split them.
Python's PDFMiner: returned an assertion error due to missing fonts. I suspect the HTML would have been similar to the output from Tika,though I'll need to resolve the issue with the missing fonts to confirm this.
Online tools: I tried http://www.zamzar.com/ and a couple of others. The file was either too big to process (for the online services) or it generated errors.

I was planning to convert the PDF to HTML and then parse it with BeautifulSoup.

The output could be JSON (e.g. one object per table), XML, or pretty much any format that maintains the structure.

Dunaj answered 24/3, 2014 at 21:40 Comment(2)

Hi Alex, did you find a solution for this problem? Would love it if you could share some thoughts on this as I am currently facing the same problem. Did you use PDFBox? Can you share the method? – Briar 31/3, 2014 at 9:41

Martijn, It's not a trivial problem to solve because pdfs do not render tabular data as such in the source. It's likely that you'll need a custom solution that's specific to the PDF that you're trying to scrape. – Chinaware 4/4, 2014 at 0:50

You could try PDFBox. The documentation for that is here:

https://pdfbox.apache.org/1.8/cookbook/textextraction.html

Extend org.apache.pdfbox.pdfviewer.PDFPageDrawer and override the strokePath method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions. You can set up text regions to determine which numbers/letters/characters are drawn in which region. Since you know the layout of the regions are tabular you'll be able to define tables and tell which column and row the extracted text belongs to using simple algorithms.

Chinaware answered 24/3, 2014 at 22:9 Comment(6)

Link is now dead. Also, can't you only interact with the PDFBox libraries using Java? As far as I see there's no way to use the API with Python. – Scruffy 6/1, 2015 at 17:18

I've fixed the link. Thank you for pointing that out. Yes, you'd have to do this with Java. I don't know of a way to do this with Python. It's my belief that PDFBox is the best tool for the job (unfortunately) and nobody has suggested anything better. – Chinaware 10/4, 2015 at 22:54

@Chinaware is there any working example of this problem? – Brinn 31/3, 2017 at 13:26

@GouravSaklecha I haven't looked a this category of problem in some time. I wouldn't think that this answer is still viable. There must be better approaches now. I'd look for a different toolset. – Chinaware 31/3, 2017 at 21:38

Thanks Jarederaj, Can you please suggest any other toolset?. I have explained my problem on this link:- #43138981 – Brinn 1/4, 2017 at 6:18

Please also see https://mcmap.net/q/850030/-pdf-table-extraction for an example of automating this approach for a table. – Goldagoldarina 14/10, 2017 at 22:36

@alex-woolford: In general, perfect extraction of data (with or without the same formatting that you see in the PDF) is not always possible, thought it is, to some extent less than 100%. I'm saying this based on having worked on a similar project to yours, earlier. I came across similar issues to what you have, and some research on the Net showed that PDF in general is not a perfectly reversible format, i.e. it is not always possible to recover the text and format from a PDF with 100% accuracy. Sometimes characters even get lost, or transposed, and so on, during the extraction process (using some library). This seems to be due to the very nature of the PDF format and specification. It is not a text-based format. It is a derivative of PostScript and has some weird rules about layout of data. And this is according to official PDF documents, or according to the sites of product companies who have been working with PDF for a long time, and whose products are well known.

If less than perfect accuracy is tolerable, there are some products available (thought I don't know of any for Python, as of now). One is xpdf and another is PDFTextStream. I've used the former, not the latter. xpdf is a C library and also has command-line tools. PDFTextStream is a Java tool/library. It was a paid product earlier, but last I checked, it is now free for single-threaded applications, IIRC.

Even though xpdf is for C and PDFTextStream is for Java, you could call them from Python via XML-RPC or some other distributed computing / cross-language communication approach such as sockets. Some work would be involved, for that, of course.

HTH.

Kettering answered 24/3, 2014 at 22:15 Comment(0)

Only FYI, as mine is not a publicly available tool: it sure is possible. Here is this one table in plain text form -- the spaces in between are tabs, not spaces:

2469-2TU    i5-3320M    4GBx1   14.0" HD    720p    500G 7200   Intel 620528    WWAN upg    Express 54  Finger  BT  6   Win7 Pro64  10/12
✂ 2469-2SU  i5-3210M    4GBx1   14.0" HD    720p    500G 7200   Intel 2200  WWAN upg    Express 54  None    None    6   Win7 Pro64  10/12
✂ 2469-2RU  i3-3110M    4GBx1   14.0" HD    720p    320G 7200   Intel 2200  WWAN upg    Express 54  None    None    6   Win7 Pro64  10/12
2469-32U    i5-3230M    4GBx1   14.0" HD    720p    320G 7200   Intel 6205  WWAN upg    None    Finger  BT  6   Win7 Pro64  02/13
2469-2ZU    i5-3230M    4GBx1   14.0" HD    720p    320G 7200   Intel 2200  WWAN upg    None    None    None    6   Win7 Pro64  02/13
2469-2YU    i5-3320M    4GBx1   14.0" HD    720p    320G 7200   Intel 6205  WWAN upg    None    Finger  BT  6   Win7 Pro64  02/13
2469-2XU    i5-3320M    4GBx1   14.0" HD    720p    320G 7200   Intel 6205  WWAN upg    None    None    None    6   Win7 Pro64  02/13
2469-2WU    i5-3320M    4GBx1   14.0" HD    720p    320G 7200   WLAN upg    WWAN upg    None    Finger  BT  6   Win7 Pro64  02/13

I second PDFBox, as it works similar to my own hand-written utility: interrogate (x,y) positions, sort, then paste together "likely" strings and insert a tab when the horizontal space is larger than one would reasonably expect.

I even got the little Scissors in Zapf Dingbats :)

Manila answered 24/3, 2014 at 22:52 Comment(0)

parse tables from a PDF document using PDFplumber

import pdfplumber
import pandas as pd
filepath = r"actualFile_path"
outfile = r"destination_path"
pdf = pdfplumber.open(filepath)
for i in range(int(len(pdf.pages))):
      df = pd.DataFrame()
      table = pdf.pages[i].extract_table(table_settings=
      {"vertical_strategy": "text", "horizontal_strategy": "text"})
      df = pd.DataFrame(table, columns=table)
df.to_csv(outfile2, mode='a', index=False)

Astereognosis answered 25/11, 2021 at 2:43 Comment(0)

parse tables from a PDF document using PDFplumber

Recommended topics

Hot tags