I am generating a PDF dynamically. How can I check the number of pages in the PDF using a shell script?
Without any extra package:
strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
| sort -rn | head -n 1
Using pdfinfo:
pdfinfo file.pdf | awk '/^Pages:/ {print $2}'
Using pdftk:
pdftk file.pdf dump_data | grep NumberOfPages | awk '{print $2}'
You can also recursively sum the total number of pages in all PDFs via pdfinfo as follows:
find . -xdev -type f -name "*.pdf" -exec pdfinfo "{}" ";" | \
awk '/^Pages:/ {n += $2} END {print n}'
12,052
while Adobe says 20,131
. The first method, however, does report the same number as Adobe. –
Lentigo awk
by using the \K
operator of grep
. The command to execute would be pdfinfo file.pdf | grep -Po 'Pages:[[:space:]]+\K[[:digit:]]+'
. –
Puss pdftoppm
, which comes pre-installed on Ubuntu: https://mcmap.net/q/347278/-how-to-write-shell-script-for-finding-number-of-pages-in-pdf. –
Gumm strings
solution work, by the way? Can you please explain it? I don't have any idea what is really contained in a PDF binary. –
Gumm strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p'
for two-three files and it works fine. Why do we pipe it to sort -rn | head -n 1
–
Lynnelynnea ls *.pdf | xargs -I%s sh -c "pdfinfo %s | awk '/^Pages:/ {n += \$2} END {print n, \"%s\"}'"
–
Subgroup The imagemagick library provides a tool called identify which in conjunction with counting the lines of output gets you what you are after...imagemagick is a easy install on osx with brew.
Here is a functional bash script that captures it to a shell variable and dumps it back to the screen...
#/bin/bash
pdfFile=$1
echo "Processing $pdfFile"
numberOfPages=$(/usr/local/bin/identify "$pdfFile" 2>/dev/null | wc -l | tr -d ' ')
#Identify gets info for each page, dump stderr to dev null
#count the lines of output
#trim the whitespace from the wc -l outout
echo "The number of pages is: $numberOfPages"
And the output of running it...
$ ./countPages.sh aSampleFile.pdf
Processing aSampleFile.pdf
The number of pages is: 2
$
$()
instead of backticks ``
see BashFAQ/082 –
Liquefy The pdftotext
utility converts a pdf file to text format inserting page breaks between the pages. (aka: form-feed characters $'\f'
):
NAME
pdftotext - Portable Document Format (PDF) to text converter.
SYNOPSIS
pdftotext [options] [PDF-file [text-file]]
DESCRIPTION
Pdftotext converts Portable Document Format (PDF) files to plain text.
Pdftotext reads the PDF file, PDF-file, and writes a text file, text-file. If text-file is
not specified, pdftotext converts file.pdf to file.txt. If text-file is ´-', the text is
sent to stdout.
There are many combinations to solve your problem, choose one of them:
1) pdftotext + grep:
$ pdftotext file.pdf - | grep -c $'\f'
2) pdftotext + awk (v1):
$ pdftotext file.pdf - | awk 'BEGIN{n=0} {if(index($0,"\f")){n++}} END{print n}'
3) pdftotext + awk (v2):
$ pdftotext sample.pdf - | awk 'BEGIN{ RS="\f" } END{ print NR }'
4) pdftotext + awk (v3):
$ pdftotext sample.pdf - | awk -v RS="\f" 'END{ print NR }'
Hope it Helps!
Here is a version for the command line directly (based on pdfinfo):
for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done
Here is a total hack using pdftoppm
, which comes preinstalled on Ubuntu (tested on Ubuntu 18.04 and 20.04 at least):
# for a pdf withOUT a password
pdftoppm mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'
# for a pdf WITH a password which is `1234`
pdftoppm -upw 1234 mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'
How does this work? Well, if you specify a f
irst page which is larger than the pages in the PDF (I specify page number 1000000
, which is too large for all known PDFs), it will print the following error to stderr
:
Wrong page range given: the first page (1000000) can not be after the last page (142).
So, I pipe that stderr
msg to stdout
with 2>&1
, as explained here, then I pipe that to grep to match the (142).
part with this regular expression (([0-9]*)\.$
), then I pipe that to grep again with this regular expression ([0-9]*
) to find just the number, which is 142
in this case. That's it!
Wrapper functions and speed testing
Here are a couple wrapper functions to test these:
# get the total number of pages in a PDF; technique 1.
# See this ans here: https://mcmap.net/q/347278/-how-to-write-shell-script-for-finding-number-of-pages-in-pdf
# Usage (works on ALL PDFs--whether password-protected or not!):
# num_pgs="$(getNumPgsInPdf "path/to/mypdf.pdf")"
# SUPER SLOW! Putting `time` just in front of the `strings` cmd shows it takes ~0.200 sec on a 142
# pg PDF!
getNumPgsInPdf() {
_pdf="$1"
_num_pgs="$(strings < "$_pdf" | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
| sort -rn | head -n 1)"
echo "$_num_pgs"
}
# get the total number of pages in a PDF; technique 2.
# See my ans here: https://mcmap.net/q/347278/-how-to-write-shell-script-for-finding-number-of-pages-in-pdf
# Usage, where `pw` is some password, if the PDF is password-protected (leave this off for PDFs
# with no password):
# num_pgs="$(getNumPgsInPdf2 "path/to/mypdf.pdf" "pw")"
# SUPER FAST! Putting `time` just in front of the `pdftoppm` cmd shows it takes ~0.020 sec OR LESS
# on a 142 pg PDF!
getNumPgsInPdf2() {
_pdf="$1"
_password="$2"
if [ -n "$_password" ]; then
_password="-upw $_password"
fi
_num_pgs="$(pdftoppm $_password "$_pdf" -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*')"
echo "$_num_pgs"
}
Testing them with the time
command in front shows that the strings
one is extremely slow, taking ~0.200 sec on a 142 pg pdf, whereas the pdftoppm
one is very fast, taking ~0.020 sec or less on the same pdf. The pdfinfo
technique in Ocaso's answer below is also very fast--the same as the pdftoppm
one.
See also
- These awesome answers by Ocaso Protal.
- These functions above will be used in my
pdf2searchablepdf
project here: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.
mupdf/mutool solution:
mutool info tmp.pdf | grep '^Pages' | cut -d ' ' -f 2
Just dug out an old script (in ksh) I found:
#!/usr/bin/env ksh
# Usage: pdfcount.sh file.pdf
#
# Optimally, this would be a mere:
# pdfinfo file.pdf | grep Pages | sed 's/[^0-9]*//'
[[ "$#" != "1" ]] && {
printf "ERROR: No file specified\n"
exit 1
}
numpages=0
while read line; do
num=${line/*([[:print:]])+(Count )?(-)+({1,4}(\d))*([[:print:]])/\4}
(( num > numpages)) && numpages=$num
done < <(strings "$@" | grep "/Count")
print $numpages
If you're on macOS you can query pdf metadata like this:
mdls -name kMDItemNumberOfPages -raw file.pdf
as seen here https://apple.stackexchange.com/questions/225175/get-number-of-pdf-pages-in-terminal
Another mutool solution making better use of the options:
mutool show file.pdf Root/Pages/Count
I made a few improvement in Marius Hofert tip to sum the returned values.
for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done | awk '{s+=$1}END{print s}'
QPDF offers the most straightforward method I'm aware of.
qpdf --show-npages input.pdf
To build on Marius Hofert's answer, this command uses a bash for loop to show you the number of pages, display the filename, and it will ignore the case of the file extension.
for f in *.[pP][dD][fF]; do pdfinfo "$f" | grep Pages | awk '{printf $2 }'; echo " $f"; done
grep
and awk
in a pipeline is a bit of an overdo, it's better to use awk
solely, which reduces the pipe count by one. Also, use shopt -s nocaseglob
to ignore the file extension's case instead of entering every capital letter manually. –
Besse A super quick but effective alternative is the great exiftool program.
exiftool -FileName -PageCount -T file.pdf
For ex. with file.pdf having 5 pages the ouptut will be:
file.pdf 5
Extra bonus:
create a text file with all pdf files and page count in current directory
exiftool -FileName -PageCount -T -ext pdf . > report.txt
can recursively scan sub folders with -r
flag
exiftool -FileName -PageCount -T -r -ext pdf . > report.txt
© 2022 - 2025 — McMap. All rights reserved.