How to write shell script for finding number of pages in PDF?
Asked Answered
T

13

50

I am generating a PDF dynamically. How can I check the number of pages in the PDF using a shell script?

Tiebout answered 5/2, 2013 at 9:39 Comment(3)
Only using builtin shell commands? Or do you "allow" external tools like e.g. pdftk or pdfinfo?Liquefy
i m ok by any means but i need page number in a variable (shell script) so that i can pass this parameter to another function.Tiebout
This question could be useful: (#36655978)Abba
L
83

Without any extra package:

strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
    | sort -rn | head -n 1

Using pdfinfo:

pdfinfo file.pdf | awk '/^Pages:/ {print $2}'

Using pdftk:

pdftk file.pdf dump_data | grep NumberOfPages | awk '{print $2}'

You can also recursively sum the total number of pages in all PDFs via pdfinfo as follows:

find . -xdev -type f -name "*.pdf" -exec pdfinfo "{}" ";" | \
    awk '/^Pages:/ {n += $2} END {print n}'
Liquefy answered 6/2, 2013 at 18:53 Comment(10)
I found that the shell only method is not always reliable. I have PDF files with only one Page having several /Count in them wit different numbers. I suggest using one othe other two methods.Rialto
@Rialto thanks for the info! Is it possible that you share at least one of these PDFs?Liquefy
On Linux, pdfinfo (v0.12.4) does not print the correct number of pages: it says 12,052 while Adobe says 20,131. The first method, however, does report the same number as Adobe.Lentigo
@ShipluMokaddim It is super hacky, but you don't need any additional packagesLiquefy
It's important to point out that the PDF count of pages may be affected by its inner objects compression. However, when it's not the case, the number of pages could be present after '.*/N' or '.*/Pages'. It's not trivial to find out which tag holds the correct value. But, the shell solution works well and is a great alternative to pdf's trailer dictionary search using pdfinfoGrammatical
You can get the number of pages without the need of awk by using the \K operator of grep. The command to execute would be pdfinfo file.pdf | grep -Po 'Pages:[[:space:]]+\K[[:digit:]]+'.Puss
Here's another one with pdftoppm, which comes pre-installed on Ubuntu: https://mcmap.net/q/347278/-how-to-write-shell-script-for-finding-number-of-pages-in-pdf.Gumm
How does your strings solution work, by the way? Can you please explain it? I don't have any idea what is really contained in a PDF binary.Gumm
I tried strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' for two-three files and it works fine. Why do we pipe it to sort -rn | head -n 1Lynnelynnea
A variant on your last option: ls *.pdf | xargs -I%s sh -c "pdfinfo %s | awk '/^Pages:/ {n += \$2} END {print n, \"%s\"}'"Subgroup
T
9

The imagemagick library provides a tool called identify which in conjunction with counting the lines of output gets you what you are after...imagemagick is a easy install on osx with brew.

Here is a functional bash script that captures it to a shell variable and dumps it back to the screen...

#/bin/bash
pdfFile=$1
echo "Processing $pdfFile"
numberOfPages=$(/usr/local/bin/identify "$pdfFile" 2>/dev/null | wc -l | tr -d ' ')
#Identify gets info for each page, dump stderr to dev null
#count the lines of output
#trim the whitespace from the wc -l outout
echo "The number of pages is: $numberOfPages"

And the output of running it...

$ ./countPages.sh aSampleFile.pdf 
Processing aSampleFile.pdf
The number of pages is: 2
$ 
Thiazole answered 6/2, 2013 at 14:16 Comment(1)
BTW: You should use $() instead of backticks `` see BashFAQ/082Liquefy
A
9

The pdftotext utility converts a pdf file to text format inserting page breaks between the pages. (aka: form-feed characters $'\f' ):

NAME
       pdftotext - Portable Document Format (PDF) to text converter.

SYNOPSIS
       pdftotext [options] [PDF-file [text-file]]

DESCRIPTION
       Pdftotext converts Portable Document Format (PDF) files to plain text.

       Pdftotext  reads  the PDF file, PDF-file, and writes a text file, text-file.  If text-file is
       not specified, pdftotext converts file.pdf to file.txt.  If text-file is  ´-',  the  text  is
       sent to stdout.

There are many combinations to solve your problem, choose one of them:

1) pdftotext + grep:

$ pdftotext file.pdf - | grep -c $'\f'

2) pdftotext + awk (v1):

$ pdftotext file.pdf - | awk 'BEGIN{n=0} {if(index($0,"\f")){n++}} END{print n}'

3) pdftotext + awk (v2):

$ pdftotext sample.pdf - | awk 'BEGIN{ RS="\f" } END{ print NR }'

4) pdftotext + awk (v3):

$ pdftotext sample.pdf - | awk -v RS="\f" 'END{ print NR }'

Hope it Helps!

Abba answered 22/4, 2016 at 18:45 Comment(1)
WATCH OUT! These different lines might give back different numbers! 1 and 2 gave me 264 on a file, but 3 and 4 returned 286. Not sure about the exact reason.Pindling
K
9

Here is a version for the command line directly (based on pdfinfo):

for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done
Koroseal answered 20/1, 2019 at 17:50 Comment(3)
I love this, thank you. Here the filename is printed to the right of the number of pages: for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{printf $2 }'; echo " $f"; doneWendellwendi
This is what I was looking for. Thanks.Hypogeal
This one counts recursively in each folder and prints file name and # of pages. find -name "*.pdf" $1 | while read x; do pdfinfo "$x" | grep Pages | awk '{printf $2 }'; echo " $x"; doneHypogeal
G
5

Here is a total hack using pdftoppm, which comes preinstalled on Ubuntu (tested on Ubuntu 18.04 and 20.04 at least):

# for a pdf withOUT a password
pdftoppm mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

# for a pdf WITH a password which is `1234`
pdftoppm -upw 1234 mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

How does this work? Well, if you specify a first page which is larger than the pages in the PDF (I specify page number 1000000, which is too large for all known PDFs), it will print the following error to stderr:

Wrong page range given: the first page (1000000) can not be after the last page (142).

So, I pipe that stderr msg to stdout with 2>&1, as explained here, then I pipe that to grep to match the (142). part with this regular expression (([0-9]*)\.$), then I pipe that to grep again with this regular expression ([0-9]*) to find just the number, which is 142 in this case. That's it!

Wrapper functions and speed testing

Here are a couple wrapper functions to test these:

# get the total number of pages in a PDF; technique 1.
# See this ans here: https://mcmap.net/q/347278/-how-to-write-shell-script-for-finding-number-of-pages-in-pdf
# Usage (works on ALL PDFs--whether password-protected or not!):
#       num_pgs="$(getNumPgsInPdf "path/to/mypdf.pdf")"
# SUPER SLOW! Putting `time` just in front of the `strings` cmd shows it takes ~0.200 sec on a 142
# pg PDF!
getNumPgsInPdf() {
    _pdf="$1"

    _num_pgs="$(strings < "$_pdf" | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
        | sort -rn | head -n 1)"

    echo "$_num_pgs"
}

# get the total number of pages in a PDF; technique 2.
# See my ans here: https://mcmap.net/q/347278/-how-to-write-shell-script-for-finding-number-of-pages-in-pdf
# Usage, where `pw` is some password, if the PDF is password-protected (leave this off for PDFs
# with no password):
#       num_pgs="$(getNumPgsInPdf2 "path/to/mypdf.pdf" "pw")"
# SUPER FAST! Putting `time` just in front of the `pdftoppm` cmd shows it takes ~0.020 sec OR LESS
# on a 142 pg PDF!
getNumPgsInPdf2() {
    _pdf="$1"
    _password="$2"

    if [ -n "$_password" ]; then
        _password="-upw $_password"
    fi

    _num_pgs="$(pdftoppm $_password "$_pdf" -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
        | grep -o '[0-9]*')"

    echo "$_num_pgs"
}

Testing them with the time command in front shows that the strings one is extremely slow, taking ~0.200 sec on a 142 pg pdf, whereas the pdftoppm one is very fast, taking ~0.020 sec or less on the same pdf. The pdfinfo technique in Ocaso's answer below is also very fast--the same as the pdftoppm one.

See also

  1. These awesome answers by Ocaso Protal.
  2. These functions above will be used in my pdf2searchablepdf project here: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.
Gumm answered 6/4, 2021 at 5:49 Comment(1)
This is light years faster than pdftk, important if you are calling this on a lot of PDFs on a dynamic web page. This is the best solution, IMO.Hydrazine
D
3

mupdf/mutool solution:

mutool info tmp.pdf | grep '^Pages' | cut -d ' ' -f 2
Dineen answered 10/11, 2020 at 3:41 Comment(0)
S
2

Just dug out an old script (in ksh) I found:

#!/usr/bin/env ksh
# Usage: pdfcount.sh file.pdf
#
# Optimally, this would be a mere:
#       pdfinfo file.pdf | grep Pages | sed 's/[^0-9]*//'

[[ "$#" != "1" ]] && {
   printf "ERROR: No file specified\n"
   exit 1
}

numpages=0
while read line; do
   num=${line/*([[:print:]])+(Count )?(-)+({1,4}(\d))*([[:print:]])/\4}
   (( num > numpages)) && numpages=$num
done < <(strings "$@" | grep "/Count")
print $numpages
Steiermark answered 6/5, 2015 at 13:55 Comment(0)
S
2

If you're on macOS you can query pdf metadata like this:

mdls -name kMDItemNumberOfPages -raw file.pdf

as seen here https://apple.stackexchange.com/questions/225175/get-number-of-pdf-pages-in-terminal

Supercool answered 6/3, 2020 at 6:14 Comment(0)
P
2

Another mutool solution making better use of the options:

mutool show file.pdf Root/Pages/Count

Poet answered 18/1, 2023 at 9:56 Comment(0)
E
1

I made a few improvement in Marius Hofert tip to sum the returned values.

for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done | awk '{s+=$1}END{print s}'
Edvard answered 17/2, 2020 at 17:53 Comment(2)
Not the downvoter, but I suspect that the reason your answer is receiving negative attention is that it would have been better left as a comment on the answer you reference.Mccammon
Yes, I know. The problem is I am new here, and stackoverflow only allows to comment with 50 reputation score. I still don't have that.Edvard
H
1

QPDF offers the most straightforward method I'm aware of.

qpdf --show-npages input.pdf
Hyperboloid answered 13/7, 2024 at 5:55 Comment(0)
W
0

To build on Marius Hofert's answer, this command uses a bash for loop to show you the number of pages, display the filename, and it will ignore the case of the file extension.

for f in *.[pP][dD][fF]; do pdfinfo "$f" | grep Pages | awk '{printf $2 }'; echo " $f"; done
Wendellwendi answered 3/2, 2021 at 23:41 Comment(1)
I think the use of both grep and awk in a pipeline is a bit of an overdo, it's better to use awk solely, which reduces the pipe count by one. Also, use shopt -s nocaseglob to ignore the file extension's case instead of entering every capital letter manually.Besse
H
0

A super quick but effective alternative is the great exiftool program.

exiftool -FileName -PageCount -T file.pdf

For ex. with file.pdf having 5 pages the ouptut will be:

file.pdf    5

Extra bonus:
create a text file with all pdf files and page count in current directory

exiftool -FileName -PageCount -T -ext pdf . > report.txt

can recursively scan sub folders with -r flag

exiftool -FileName -PageCount -T -r -ext pdf . > report.txt
Hardecanute answered 13/7, 2024 at 12:31 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.