PDF parsing in C++ (PoDoFo)
Asked Answered
R

2

13

Hi so I'm trying to parse some text from some pdfs and I would like to use PoDoFo, now I have tried searching for examples of how to use PoDoFo to parse a pdf however all I can come up with is examples of how to create and write a pdf file which is not what I really need.

If anyone has any tutorial or example of parsing a PDF file with PoDoFo or have suggestions for a different library that I can use please let me know. Also I know there is pdftotext on linux, however, not only can I not use that, but I would much rather be able to do everything I need to internally and not rely on outside programs being installed.

Rafaellle answered 30/7, 2012 at 4:53 Comment(2)
Have you looked at SumatraPDF? It's open source (though GPL - see repo).Bankston
I have not but from the looks of it it only works with Windows which isn't what I'm looking for. Thanks though!Rafaellle
E
45

PoDoFo does not provide a means to easily extract text from a document, but it is not hard to do.

Load a document into a PdfMemDocument:

PoDoFo::PdfMemDocument pdf("mydoc.pdf");

Iterate over each page:

for (int pn = 0; pn < pdf.GetPageCount(); ++pn) {
    PoDoFo::PdfPage* page = pdf.GetPage(pn);

Iterate over all the PDF commands on that page:

    PoDoFo::PdfContentsTokenizer tok(page);
    const char* token = nullptr;
    PoDoFo::PdfVariant var;
    PoDoFo::EPdfContentsType type;
    while (tok.ReadNext(type, token, var)) {
        switch (type) {
            case PoDoFo::ePdfContentsType_Keyword:
                // process token: it contains the current command
                //   pop from var stack as necessary
                break;
            case PoDoFo::ePdfContentsType_Variant:
                // process var: push it onto a stack
                break;
            default:
                // should not happen!
                break;
        }
    }
}

The "process token" & "process var" comments is where it gets a little more complex. You are given raw PDF commands to process. Luckily, if you're not actually rendering the page and all you want is the text, you can ignore most of them. The commands you need to process are:

BT, ET, Td, TD, Ts, T, Tm, Tf, ", ', Tj and TJ

The BT and ET commands mark the beginning and end of a text stream, so you want to ignore anything that's not between a BT/ET pair.

The PDF language is RPN based. A command stream consists of values which are pushed onto a stack and commands which pop values off the stack and process them.

The ", ', Tj and TJ commands are the only ones which actually generate text. ", ' and Tj return a single string. Use var.IsString() and var.GetString() to process it.

TJ returns an array of strings. You can extract each one with:

if (var.isArray()) {
    PoDoFo::PdfArray& a = var.GetArray();
    for (size_t i = 0; i < a.GetSize(); ++i)
        if (a[i].IsString())
            // do something with a[i].GetString()

The other commands are used to determine when to introduce a line break. " and ' also introduce line breaks. Your best bet is to download the PDF spec from Adobe and look up the text processing section. It explains what each command does in more detail.

I found it very helpful to write a small program which takes a PDF file and dumps out the command stream for each page.

Note: If all you're doing is extracting raw text with no positioning information, you don't actually need to maintain a stack of var values. All the text rendering commands have, at most, one parameter. You can simply assume that the last value in var contains the parameter for the current command.

Emptor answered 30/7, 2012 at 10:15 Comment(6)
Thank you for a well written explanation! I will accept your answer after trying this out in the next few days in case I need clarificationRafaellle
@Emptor , Quiet a detailed and helpful answer, can you also add some info on, if images also have operators and how can one identify them?Multinational
@codin - I haven't worked with images in PDF files, so I can't give you a detailed answer. I would suggest looking up images in the PDF specs (Section 8.9: Images).Emptor
@Ferruccio, Thanks, I find that Do operators are for images. But its getting difficult to remove artwork other than image objects, Like background behind text etc.Multinational
When I am working with text, I am getting most of my words broken up into multiple objects within an array, without any regard for whitespace. E.g.: (e)(v)(en)(inef)(f)(ectiv)(e)(until)(w)(e)(unders)(t)(ood)(some)(of)(its)(im)(por)(tant)(dif)(f)(erences). This is within a single variant that has a TJ tag. Is there somewhere else that whitespace information is contained for something with a TJ tag?Emissary
@user1362215 - The array you get with a TJ tag should contain numbers and strings. The numbers represent spacing information. See PDF Reference 1.7, section 5.3 for detailed information.Emptor
D
4

I haven't used PoDoFo, but a quick browse through the class hierarchy on their API webpage reveals:

void PoDoFo::PdfMemDocument::Load( const char * pszFilename )

(API doc link)

So I would just hazard a guess here, that you do:

PoDoFo::PdfMemDocument doc;
doc.Load( "somefile.pdf" );

Then I imagine you navigate the document tree by calling doc.GetObjects() and walking through that array (see PdfDocument class)

Dejection answered 30/7, 2012 at 5:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.