Programmatically Extract PDF Tables
Asked Answered
S

9

26

I have a bunch of PDF docs with tabular data in them which I need to extract into a more readable format to store in a spreadsheet, database or whatever.

Is there anything out in the world (preferably free) that is able to get tabular data out of PDFs into a more readable format in bulk either natively integrated with an app or passively via command line or looping the process in code(.net)?

Can be any format really (doc, html) just as long as the tables are maintained.

Anything I've found so far is either a one-off (only does one doc at a time, I have hundreds, that isn't happening) or does not maintain the table structure.

Any ideas please post.

Subterrane answered 6/8, 2010 at 14:12 Comment(4)
It would help if you could expand this question with specific examples of the source PDF, as this is required to answer the question with any precision.Bandit
@Thilo - you attached a bounty to this question, and it's not clear that @Subterrane is paying any attention. Do you have some sample data to point to that you would like addressed?Bandit
@Bandit This is related to #3930293 (I get the text data from pdftotext).Pullulate
Possible duplicate of How to read table from PDF using itextsharp?Sin
B
17

This is a giant hassle. In general, extracting the text content of a PDF file is running against the grain of what PDF wants you to do.

Start by trying to get the text out. This may be more or less successful, depending on how the PDF is built. One place to start is GhostScript or pstotext. If that fails you, this guy has a list of text extraction tools. Once you have the text stream, you could then try to reassemble the tabular structure programmatically.

Finally, if you are in seriously bad shape, and if the PDFs don't cooperate, you could do the OCR thing. The right long term solution is to get the data into the right format at the outset, either by doing a single, massive, painful, and probably partially-manual process; or to go to the source and suggest that the data be provided in a more useable form.

If you can give a more specific PDF example file, there may be a better or more precise answer... there is NO general solution to this, if it's possible, it will need to be tailored to your specific source data.

Note this rather pointed response to the general question... doesn't help with the fact that you have the problem in front of you, but maybe it would provide useful topcover when explaining to your boss why there isn't an obvious answer? ;-)

A new SO question popped up, and referred to this library -- iTextSharp -- which looks possibly related. SO question: Best way to extract...

Bandit answered 15/10, 2010 at 16:22 Comment(0)
P
5
  1. For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:

  2. For an amazing family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages), contradicting point '1.' above see these links:

Paterson answered 29/9, 2014 at 23:35 Comment(0)
B
2

Check out IvyTools IvyPdf: www.ivytools.net It can extract tables as well as any other data. If your documents are well-structured it's very easy to setup, but it can deal with pretty complicated scenarios too. It's free for personal use.

Buchholz answered 10/4, 2018 at 22:17 Comment(0)
S
1

Considering your requirement, Straight forward answer for your question would be it is quite not possible. The reason is, unlike word/excel, PDF specification does not have a object called Table. The table which you see in those PDF documents are just series of rectangle drawn in such a way that it looks like table and it is up to PDF Writer which created those PDF files, because some might draw table kind of structure using Series of Line.

But possibly you could write your own parser based on PDF File Specification , but it is still a daunting task if you choose to implement your own parser and it will take several months to get one which is working with quite a few PDF documents.

Incase, you decided to write your own parser. The below article would give you jump start. Code Project Article

Slung answered 17/10, 2010 at 18:51 Comment(7)
There are a bunch of PDF toolsets out there... I don't know how this helps answer the question.Bandit
@andersoj, Thanks for your feedback. I've been developing commercial PDF solution for the past 2 years. Based on my knowledge and years of experience in the PDF file format, And this question was asked by several of our customers in the past. Hence I gave my straight-forward response. Also, as far as I know there is no such components available in the market. But there are some commercial solutions available which would export PDF as Word Document and I know how far they are reliable ;) Cheers,Slung
Ah, that's similar to the LaTeX to Word approach? Generate one bitmap for each page, place on the page, ready is your word document?Phytoplankton
@Karthik -- I removed my downvote. As a PDF guru, you know that the question isn't answerable in its current form -- suppose these tables were encoded as embedded images? Suppose they used a non-standard font/font encoding? Given then PDF has little in the way of semantics, and the haphazard ways PDF output has been structured by various producers, these problems are rife... We need sample data to answer the question.Bandit
@Stephan, No, those tools don't use bitmap based approach. Instead, those tools basically parse the given PDF file then extract text and it's positions during the first pass and based on the text XY position retrieved from the PDF document, they create new word document. This approach would work fine with few documents (where you will get similar output as exists in PDF), but there is no guarantee that this will work reliably with all the PDF documents.Slung
@andersoj, If the tables are encoded as embedded images, then we could extract the image from PDF file, with some small tweaks to ITextSharp library code. But the thing is that most PDF producers don't typically do this, because if you encode table as image, then the text contents within those table will not be selectable and searchable in the Pdf Reader(for ex: Adobe Reader). I am not quite sure about what you meant by Sample data. If @Subterrane could share couple of PDF files from which he wants to extract tables, then I could share some further details. thank you so much ;)Slung
@Karthik -- re: sample data, that's exactly what I meant. Beyond pointing the questioner to some toolkits, vaguely, we'd need a sample PDF to see if any of them would really apply. Agreed that most contemporary PDF producers wouldn't embed images, but if the questioner was working with a contemporary producer, he could probably get the data in a much more suitable form than PDF! Several times I've wanted to do this to extract hundreds of pages of protocol spec from a recent (~2002 era) MILSPEC document, only to find that I had to OCR the whole thing b/c it was all images.Bandit
U
1

PDF format is build as a collection of letters, which have no inherent format or anything. You can think of PDF just as a page that has come through the OCR and you are taking it from there - letters and their coordinates are there - rest is up to you - to figure out layout, formats, columns, and eventual tables.

Unmoor answered 17/10, 2010 at 18:56 Comment(0)
K
0

When you say

Anything I've found so far ... only does one doc at a time

I'll assume you mean "is a GUI app, without a programming interface."

In this case you could use Microsoft UI Automation to programmatically control the app and make it do what you want.

UIA ... provides a means for exposing and collecting information about user interface elements and controls to support user interface accessibility and software test automation ... and is compatible with both Win32 and the .NET Framework.

Kelcy answered 15/10, 2010 at 0:37 Comment(0)
S
0

If all the data is text data, you can always use iTextSharp. It's free and you only need the "itextsharp.dll".

http://sourceforge.net/projects/itextsharp/

Here is a simple function for reading the text out of a PDF.

Public Shared Function GetTextFromPDF(PdfFileName As String) As String
    Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)

    Dim sOut = ""

    For i = 1 To oReader.NumberOfPages
        Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy

        sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
    Next

    Return sOut
End Function

That will at least get you the text to start with.

Stereobate answered 1/9, 2011 at 18:27 Comment(1)
It is not free for commercial use.Plasterboard
M
0

I've tried extracting the plain text from PDFs using tools like pdf2text, but too much of the table and formatting and layout information is lost to accurately reconstruction the original version.

It might be more successful to use a PDF API to extract the x,y positions of the text boxes and lines, and use that information to reconstruct the table.

There seem to be several third party tools and APIs that try this approach:

The paid version of Solid Framework seems to be able to extract tables from PDF to Excel and CSV automatically and fairly well from the PDFs I've thrown at it.

The free PDF Mechanic seems to be a small GUI program wrapped around Solid Framework, which you can use to try out their PDF extraction technique.

There's also the free tool pdf2table which you might be able to call from your program, but I haven't tried it yet.

Mcleroy answered 11/10, 2011 at 9:14 Comment(0)
W
0

I recently ran into this problem.

An alternate solution I found was to open a PDF document in Adobe and export it to xml. At least with my PDF's it preserved the table information and then I was able to programmatically work with the XML to generate tabular files like excel etc.

The other issue I ran into was that Adobe only lets you export one file at a time and I had lots of files. Luckily Adobe also has a merge function. I ended up merging all the files together and then exporting them as one big XML file and working with that file to generate what I needed.

Wiredraw answered 13/5, 2015 at 15:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.