Split a PDF by Bookmarks?
Asked Answered
R

6

15

I am to process single PDFs that have each been created by 'merging' multiple PDFs. Each of the merged PDF has the places where the PDF parts start displayed with a bookmark.

Is there any way to automatically split this up by bookmarks with a script?

We only have the bookmarks to indicate the parts, not the page numbers, so we would need to infer the page numbers from the bookmarks. A Linux tool would be best.

Resurrect answered 8/4, 2010 at 16:54 Comment(0)
S
3

you have programs that are built like pdf-split that can do that for you:

A-PDF Split is a very simple, lightning-quick desktop utility program that lets you split any Acrobat pdf file into smaller pdf files. It provides complete flexibility and user control in terms of how files are split and how the split output files are uniquely named. A-PDF Split provides numerous alternatives for how your large files are split - by pages, by bookmarks and by odd/even page. Even you can extract or remove part of a PDF file. A-PDF Split also offers advanced defined splits that can be saved and later imported for use with repetitive file-splitting tasks. A-PDF Split represents the ultimate in file splitting flexibility to suit every need.

A-PDF Split works with password-protected pdf files, and can apply various pdf security features to the split output files. If needed, you can recombine the generated split files with other pdf files using a utility such as A-PDF Merger to form new composite pdf files.

A-PDF Split does NOT require Adobe Acrobat, and produces documents compatible with Adobe Acrobat Reader Version 5 and above.

edit*

also found a free open sourced program Here if you do not want to pay.

Semitone answered 8/4, 2010 at 17:0 Comment(2)
Any Linux programs that are similar to A-PDF Split?Resurrect
@Resurrect linux.softpedia.com/get/Printing/Pdfsam-40703.shtml this is a link to pdfsam, but you can go to the main page, the second link in my post, this is supposed to be compatible with linux.Semitone
A
22

pdftk can be used to split the PDF file and extract the page numbers of the bookmarks.

To get the page numbers of the bookmarks do

pdftk in.pdf dump_data

and make your script read the page numbers from the output.

Then use

pdftk in.pdf cat A-B output out_A-B.pdf

to get the pages from A to B into out_A-B.pdf.

The script could be something like this:

#!/bin/bash

infile=$1 # input pdf
outputprefix=$2

[ -e "$infile" -a -n "$outputprefix" ] || exit 1 # Invalid args

pagenumbers=( $(pdftk "$infile" dump_data | \
                grep '^BookmarkPageNumber: ' | cut -f2 -d' ' | uniq | sort -n)
              end )

for ((i=0; i < ${#pagenumbers[@]} - 1; ++i)); do
  a=${pagenumbers[i]} # start page number
  b=${pagenumbers[i+1]} # end page number
  [ "$b" = "end" ] || b=$[b-1]
  pdftk "$infile" cat $a-$b output "${outputprefix}"_$a-$b.pdf
done
Applicant answered 10/4, 2012 at 9:20 Comment(2)
Nice :) I'm using grep -A1 '^BookmarkLevel: 1' | grep '^BookmarkPageNumber: ' to obtain only top-level bookmarks. Unfortunately all lower-level bookmarks get lost this way though...Scaphoid
I just wanted to mention that this bash script still works fine on macOS Sierra with pdftk. Nicely done!Och
G
4

There's a command line tool written in Java called Sejda where you can find the splitbybookmarks command that does exactly what you asked. It's Java so it runs on Linux and being a command line tool you can write script to do that.

Disclaimer
I'm one of the authors

Grill answered 18/12, 2012 at 23:47 Comment(3)
They have limit of 200 pages.Camber
No, there isn't any limit.. please open an issue if you are facing some problem.Grill
sejda-console requires Pro, which is 2000$/year. Certainly not an option for my use case.Pires
S
3

you have programs that are built like pdf-split that can do that for you:

A-PDF Split is a very simple, lightning-quick desktop utility program that lets you split any Acrobat pdf file into smaller pdf files. It provides complete flexibility and user control in terms of how files are split and how the split output files are uniquely named. A-PDF Split provides numerous alternatives for how your large files are split - by pages, by bookmarks and by odd/even page. Even you can extract or remove part of a PDF file. A-PDF Split also offers advanced defined splits that can be saved and later imported for use with repetitive file-splitting tasks. A-PDF Split represents the ultimate in file splitting flexibility to suit every need.

A-PDF Split works with password-protected pdf files, and can apply various pdf security features to the split output files. If needed, you can recombine the generated split files with other pdf files using a utility such as A-PDF Merger to form new composite pdf files.

A-PDF Split does NOT require Adobe Acrobat, and produces documents compatible with Adobe Acrobat Reader Version 5 and above.

edit*

also found a free open sourced program Here if you do not want to pay.

Semitone answered 8/4, 2010 at 17:0 Comment(2)
Any Linux programs that are similar to A-PDF Split?Resurrect
@Resurrect linux.softpedia.com/get/Printing/Pdfsam-40703.shtml this is a link to pdfsam, but you can go to the main page, the second link in my post, this is supposed to be compatible with linux.Semitone
W
1

Here's a little Perl program I use for the task. Perl isn't special; it's just a wrapper around pdftk to interpret its dump_data output to turn it into page numbers to extract:

#!perl
use v5.24;
use warnings;

use Data::Dumper;
use File::Path qw(make_path);
use File::Spec::Functions qw(catfile);

my $pdftk = '/usr/local/bin/pdftk';
my $file = $ARGV[0];
my $split_dir = $ENV{PDF_SPLIT_DIR} // 'pdf_splits';

die "Can't find $ARGV[0]\n" unless -e $file;

# Read the data that pdftk spits out.
open my $pdftk_fh, '-|', $pdftk, $file, 'dump_data';

my @chapters;
while( <$pdftk_fh> ) {
    state $chapter = 0;
    next unless /\ABookmark/;

    if( /\ABookmarkBegin/ ) {
        my( $title ) = <$pdftk_fh> =~ /\ABookmarkTitle:\s+(.+)/;
        my( $level ) = <$pdftk_fh> =~ /\ABookmarkLevel:\s+(.+)/;

        my( $page_number ) = <$pdftk_fh> =~ /\BookmarkPageNumber:\s+(.+)/;

        # I only want to split on chapters, so I skip higher
        # level numbers (higher means more nesting, 1 is lowest).
        next unless $level == 1;

        # If you have front matter (preface, etc) then this numbering
        # will be off. Chapter 1 might be called Chapter 3.
        push @chapters, {
            title         => $title,
            start_page    => $page_number,
            chapter       => $chapter++,
            };
        }
    }

# The end page for one chapter is one before the start page for
# the next chapter. There might be some blank pages at the end
# of the split for PDFs where the next chapter needs to start on
# an odd page.
foreach my $i ( 0 .. $#chapters - 1 ) {
    my $last_page = $chapters[$i+1]->{start_page} - 1;
    $chapters[$i]->{last_page} = $last_page;
    }
$chapters[$#chapters]->{last_page} = 'end';

make_path $split_dir;
foreach my $chapter ( @chapters ) {
    my( $start, $end ) = $chapter->@{qw(start_page last_page)};

    # slugify the title so use it as a filename
    my $title = lc( $chapter->{title} =~ s/[^a-z]+/-/gri );

    my $path = catfile( $split_dir, "$title.pdf" );
    say "Outputting $path";

    # Use pdftk to extract that part of the PDF
    system $pdftk, $file, 'cat', "$start-$end", 'output', $path;
    }
Weaks answered 14/2, 2020 at 2:20 Comment(0)
R
0

I wrote a Python script to split a PDF in two at a bookmark with a specific name, using pdftk. This script preserves the bookmarks in the two output PDFs.

Radon answered 19/6, 2023 at 21:57 Comment(1)
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From ReviewCeleski
C
0

You can use pdf_extbook to extract bookmarked PDFs on Linux.

It's libre software.

It uses pdftk to read the bookmarks from the file, fzf to allow the user to select which bookmark to extract, and pdftk again to extract bookmarked pages.

Cytology answered 13/3, 2024 at 1:6 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.