Extracting text from PDF with Poppler (C++)

Asked 28/4, 2010 at 18:31 Answered 4/11, 2013 at 9:36

I'm trying to get my way through Poppler and its (lack of) documentation.

What I want to do is a very simple thing: open a PDF file and read the text in it. I'm then going to process the text, but that doesn't really matter here.

So... I saw the poppler_page_get_text function, and it kind of works, but I have to specify a selection rectangle, which is not very handy. Isn't there just a very simple function that would output the PDF text in order (maybe line by line?).

Floreneflorentia answered 28/4, 2010 at 18:31 Comment(1)

The poppler source code includes two simple example programs in ./cpp/tests that illustrate all functionality. – Graziano 25/2, 2016 at 11:11

You should be able to set the selection rectangle to the pageSize/MediaBox of the page and get all the text.

I say should because before you start wondering why you get surprised by the output of poppler_page_get_text, you should be aware of how text gets laid out on a page. All graphics are laid out on a page using a program expressed in post-fix notation. To render the page, this program is executed on a blank page.

Operations in the program can include, changing colors, position, current transformation matrix, drawing lines, bezier curves and so on. Text is laid out by a series of text operators that are always bracketed by BT (begin text) and ET (end text). How or where text is placed on a page is at the sole discretion of the software that generates the PDF. For example, for print drivers, the code responds to GDI calls for DrawString and translates that into text drawing operations.

If you are lucky, the text on the page is laid out in a sane order with sane font usage, but many programs that generate PDF aren't so kind. Psroff, for example liked to place all the plain text first, then the italic text, then the bold text. Words may or may not be placed in reading order. Fonts may be re-encoded so that 'a' maps to '{' or whatever. Then you might have ligatures where multiple characters are replaced by single glyphs - the most common ones are ae, oe, fi, fl, and ffl.

With all of this in place, the process of extracting text is decidedly non-trivial, so don't be surprised if you see poor quality results from text extraction.

I used to work on the text extraction tools in Acrobat 1.0 and 2.0 - it's a real challenge to get right.

Grisons answered 29/4, 2010 at 19:13 Comment(1)

Thank you a lot for the explanation. I think I will start to read a bit more extensively about how the PDF is coded then. Or try to rethink my strategy a little bit... :) Cheers nico – Floreneflorentia 1/5, 2010 at 11:42

Just for the records, I am using poppler right now with this little program

#include <iostream>

#include "poppler-document.h"
#include "poppler-page.h"
using namespace std;

int main()
{
    poppler::document *doc = poppler::document::load_from_file("./CMI2APIDocV1.4.pdf");
    const int pagesNbr = doc->pages();
    cout << "page count: " << pagesNbr << endl;

    for (int i = 0; i < pagesNbr; ++i)
        cout << doc->create_page(i)->text().to_latin1().c_str() << endl;
}

// g++ -I/usr/include/poppler/cpp/ -c poppler.cpp
// g++ -I/usr/include/poppler/cpp poppler.o  /usr/lib/x86_64-linux-gnu/libpoppler-cpp.a /usr/lib/x86_64-linux-gnu/libpoppler.a /usr/lib/x86_64-linux-gnu/liblcms2.so     /usr/lib/x86_64-linux-gnu/libfontconfig.a /usr/lib/x86_64-linux-gnu/libjpeg.a /usr/lib/x86_64-linux-gnu/libfreetype.a     /usr/lib/x86_64-linux-gnu/libexpat.a /usr/lib/x86_64-linux-gnu/libz.a

I am quite happy with th result so far, except for arrays and "spreadsheet" restitution in pure text, where sometime a single cell may span through multiple lines. (if someone knows how to avoid that ?)

Carlow answered 4/11, 2013 at 9:36 Comment(1)

There is a related question concerning "spreadsheet"-type of data : Extracting tables from PDF files programmatically?. – Wager 28/1, 2016 at 12:57

Recommended topics

Hot tags