Extract TOC of PDF?
Asked Answered
C

4

32

I am extracting a pdf into images / swf and text with the help of SWFTools and XPDF.. I am running these in a PDF script.

But now I am trying to go one step further and try to get the TOC from the PDF is it possible to extract this information?

Childers answered 12/3, 2010 at 8:50 Comment(0)
A
14

I found this with a little bit of searching. It looks rather promising.

PDFMiner: http://www.unixuser.org/~euske/python/pdfminer/index.html

Note: The tool is Python based, but you should be able to use the tool via shell access. Alternatively, you may be able to glean some useful info from the source code itself, as the project is open source.

From the Site:

dumppdf.py

dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images).

Examples:

$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)

$ dumppdf.py -T foo.pdf
(dump the table of contents)

$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)
Arose answered 24/3, 2010 at 23:13 Comment(6)
Upon further investigation, I could find some really useful applications for this tool, myself! +1 to Yusuke Shinyama and the rest of the PDFMiner team!Arose
Thanks I will have a look.. but does it generate the TOC inside the XML too.. ass now I am using XPDF and PDF2SWF to get the content already :) But no option for the TOCChilders
I guess I'm not sure what you're asking. The second "example" line claims to dump specifically the TOC to an XML file, which you can parse in whatever manner suits you. I haven't used the tool myself, it just sounds like it would accomplish what you're wanting to do.Arose
Ahhh thx sorry I missed it ^^ I will give it a shot that or mupdfChilders
dumppdf -T file.pdf | grep \<outline gives a nice readable table of content. (dumppdf -T file.pdf | grep -E '\<outline|pageno' also gives the page numbers)Zakarias
this solution is always throwing PDFNoOutlines exception. this exception is been threatening. any solution for this? I tried more than one pdfs but the error still persists.Willy
C
16

I tried dump.pdf -T, but it did not work on some PDF files.

There is another tool from MuPDF named mutool, which I just found. I don't know if this is better than dump.pdf but worked on a PDF file dump.pdf throws an error.

Here's how to extract TOC with mutool

mutool show {your-pdf-file} outline

MuPDF

Caneghem answered 6/5, 2016 at 13:19 Comment(1)
Great method (easier to visualize than dumppdf), and it also displays the page number at the end, with the anchor position in X and Y choordinates (format "#PAGE,X,Y", where X and Y are distances from top left corner, in UserUnit, which by default equals 1/72inchs=2.54/72cm, but can be changed). Example for something in page 19: + "The name of the section" #19,135,421Monkhmer
A
14

I found this with a little bit of searching. It looks rather promising.

PDFMiner: http://www.unixuser.org/~euske/python/pdfminer/index.html

Note: The tool is Python based, but you should be able to use the tool via shell access. Alternatively, you may be able to glean some useful info from the source code itself, as the project is open source.

From the Site:

dumppdf.py

dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images).

Examples:

$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)

$ dumppdf.py -T foo.pdf
(dump the table of contents)

$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)
Arose answered 24/3, 2010 at 23:13 Comment(6)
Upon further investigation, I could find some really useful applications for this tool, myself! +1 to Yusuke Shinyama and the rest of the PDFMiner team!Arose
Thanks I will have a look.. but does it generate the TOC inside the XML too.. ass now I am using XPDF and PDF2SWF to get the content already :) But no option for the TOCChilders
I guess I'm not sure what you're asking. The second "example" line claims to dump specifically the TOC to an XML file, which you can parse in whatever manner suits you. I haven't used the tool myself, it just sounds like it would accomplish what you're wanting to do.Arose
Ahhh thx sorry I missed it ^^ I will give it a shot that or mupdfChilders
dumppdf -T file.pdf | grep \<outline gives a nice readable table of content. (dumppdf -T file.pdf | grep -E '\<outline|pageno' also gives the page numbers)Zakarias
this solution is always throwing PDFNoOutlines exception. this exception is been threatening. any solution for this? I tried more than one pdfs but the error still persists.Willy
D
3

Alternatively, you can use MuPDF which is a pretty lightweight but complete PDF implementation written C. In the apps/ subdirectory you will find some tools which can view, dump and extract information from PDF files. I'd prefer MuPDF over xpdf because it is actively maintained and has better PDF support.

Otherwise, there's always Poppler which is actually based upon xpdf. The developers ported its code to C++. Hence, it's performs worse than its predecessor. Compared to MuPDF, Poppler seems to have slightly more features, but in return the code is much more complex.

For your purposes MuPDF should be sufficient though. You could hack together a simple application from the example code provided in apps/ that extracts all the information you need without relying on external applications.

Disario answered 31/3, 2010 at 1:58 Comment(0)
W
0

I think looking at PHP's PDFLib would be a very good place to start. If you scroll down, you will see plenty of user-posted solutions for converting PDF to HTML or PDF to Text. After conversion, a relatively simple match function could extract the tagged TOC items and throw them into an array for example, which you can then manipulate as you please.

This StackOverflow post also has some more solutions.

Hope this helps.

Woodbridge answered 24/3, 2010 at 6:38 Comment(2)
I am using XPDF pdf2txt already... but how would you match this? the ToC is normally created by hand.. and the info needs to be somewhere in the pdf.. (as they can have the side panel)Childers
The TOC should only be created by hand when people don't have the required professional tools to do that automatically. If done automatically, the items in the TOC get tagged as bookmarks (and I think this is what you're referring to as the "side panel") and linked to their pages and are thus easier to match. If they are done by hand, then they are no different than any other chunk of text anywhere in that PDF and having a script successfully match them would be close to impossible.Woodbridge

© 2022 - 2024 — McMap. All rights reserved.