Get the number of pages in a PDF document
Asked Answered
A

17

80

This question is for referencing and comparing. The solution is the accepted answer below.

Many hours have I searched for a fast and easy, but mostly accurate, way to get the number of pages in a PDF document. Since I work for a graphic printing and reproduction company that works a lot with PDFs, the number of pages in a document must be precisely known before they are processed. PDF documents come from many different clients, so they aren't generated with the same application and/or don't use the same compression method.

Here are some of the answers I found insufficient or simply NOT working:

Using Imagick (a PHP extension)

Imagick requires a lot of installation, apache needs to restart, and when I finally had it working, it took amazingly long to process (2-3 minutes per document) and it always returned 1 page in every document (haven't seen a working copy of Imagick so far), so I threw it away. That was with both the getNumberImages() and identifyImage() methods.

Using FPDI (a PHP library)

FPDI is easy to use and install (just extract files and call a PHP script), BUT many of the compression techniques are not supported by FPDI. It then returns an error:

FPDF error: This document (test_1.pdf) probably uses a compression technique which is not supported by the free parser shipped with FPDI.

Opening a stream and search with a regular expression:

This opens the PDF file in a stream and searches for some kind of string, containing the pagecount or something similar.

$f = "test1.pdf";
$stream = fopen($f, "r");
$content = fread ($stream, filesize($f));

if(!$stream || !$content)
    return 0;

$count = 0;
// Regular Expressions found by Googling (all linked to SO answers):
$regex  = "/\/Count\s+(\d+)/";
$regex2 = "/\/Page\W*(\d+)/";
$regex3 = "/\/N\s+(\d+)/";

if(preg_match_all($regex, $content, $matches))
    $count = max($matches);

return $count;
  • /\/Count\s+(\d+)/ (looks for /Count <number>) doesn't work because only a few documents have the parameter /Count inside, so most of the time it doesn't return anything. Source.
  • /\/Page\W*(\d+)/ (looks for /Page<number>) doesn't get the number of pages, mostly contains some other data. Source.
  • /\/N\s+(\d+)/ (looks for /N <number>) doesn't work either, as the documents can contain multiple values of /N ; most, if not all, not containing the pagecount. Source.

So, what does work reliable and accurate?

See the answer below

Anglian answered 1/2, 2013 at 10:33 Comment(0)
A
113

A simple command line executable called: pdfinfo.

It is downloadable for Linux and Windows. You download a compressed file containing several little PDF-related programs. Extract it somewhere.

One of those files is pdfinfo (or pdfinfo.exe for Windows). An example of data returned by running it on a PDF document:

Title:          test1.pdf
Author:         John Smith
Creator:        PScript5.dll Version 5.2.2
Producer:       Acrobat Distiller 9.2.0 (Windows)
CreationDate:   01/09/13 19:46:57
ModDate:        01/09/13 19:46:57
Tagged:         yes
Form:           none
Pages:          13    <-- This is what we need
Encrypted:      no
Page size:      2384 x 3370 pts (A0)
File size:      17569259 bytes
Optimized:      yes
PDF version:    1.6

I haven't seen a PDF document where it returned a false pagecount (yet). It is also really fast, even with big documents of 200+ MB the response time is a just a few seconds or less.

There is an easy way of extracting the pagecount from the output, here in PHP:

// Make a function for convenience 
function getPDFPages($document)
{
    $cmd = "/path/to/pdfinfo";           // Linux
    $cmd = "C:\\path\\to\\pdfinfo.exe";  // Windows
    
    // Parse entire output
    // Surround with double quotes if file name has spaces
    exec("$cmd \"$document\"", $output);

    // Iterate through lines
    $pagecount = 0;
    foreach($output as $op)
    {
        // Extract the number
        if(preg_match("/Pages:\s*(\d+)/i", $op, $matches) === 1)
        {
            $pagecount = intval($matches[1]);
            break;
        }
    }
    
    return $pagecount;
}

// Use the function
echo getPDFPages("test 1.pdf");  // Output: 13

Of course this command line tool can be used in other languages that can parse output from an external program, but I use it in PHP.

I know its not pure PHP, but external programs are way better in PDF handling (as seen in the question).

I hope this can help people, because I have spent a whole lot of time trying to find the solution to this and I have seen a lot of questions about PDF pagecount in which I didn't find the answer I was looking for. That's why I made this question and answered it myself.

Security Notice: Use escapeshellarg on $document if document name is being fed from user input or file uploads.

Anglian answered 1/2, 2013 at 10:33 Comment(15)
+1 For taking time to help community and for sharing your knowledge gained as a result of this problemJadeite
As an alternative (if pdfinfo is not available on the server), you can also use pdftk with the dump_data option. You just have to do a few changes : - Set the $cmd variable to the pdftk binary - Change the preg_match call from Pages to NumberOfPages And that's all :-)Gleiwitz
@bouchon - It sure looks like something nice (the Server one that is, the rest has a GUI), although you have to install it. pdfinfo is a single binary file. Just download it and place it anywhere (e.g. next to your PHP script for easy access)Anglian
I make a composer package for this. Wish it can help github.com/howtomakeaturn/pdfinfoWallin
@尤川豪 Wow, that's really impressive! I'm honored :)Anglian
This can be done right in the shell using the usual gnu tools: pdfinfo $PDF_File | grep Pages | awk '{print $2}'Selfrenunciation
Any recommendation for Centos/Amazon Linux? pdfinfo and xpdf don't seem to be available for this OS.Osana
I found poppler. sudo yum install poppler-utils and now I have pdfinfo in Amazon Linux on EC2Osana
mutool info from the mupdf-tools package is significantly faster than pdfinfo.Clausen
pdfinfo is inside sudo apt-get install poppler-utils for the lazy ubuntu / lxss usersHollar
What if I need to get the number of pages of an online PDF without downloading it?Biome
@f126ck well you need some way of reading the file. So I guess you could load it with curl or wget into a temp file and then execute that script on itAnglian
pdfinfo is definitively worth it to count pages against command line gs or npm packages like pdf2json. Its output is immediate compared to others that take several seconds for large files. Thank you!Telephoto
I can only recommend qpdf, because qpdf returns json or pure values, and you save yourself the parsing. This is not only faster but also less code.Teeny
Be aware that pdfinfo (and all of poppler-utils) is licensed under GPL. glyphandcog.com/opensource.htmlLegator
E
33

Simplest of all is using ImageMagick

here is a sample code

$image = new Imagick();
$image->pingImage('myPdfFile.pdf');
echo $image->getNumberImages();

otherwise you can also use PDF libraries like MPDF or TCPDF for PHP

Equation answered 30/12, 2015 at 15:29 Comment(3)
Brilliant, thank you, just something to note though, not all PHP installations have the imagick mod installed... you may need to check if that class exists first.Printer
i found this worked at first, but then some PDFs gave an error of 'Failed to read the file' presumably they were not compatible. Suggest using the library noted above: github.com/howtomakeaturn/pdfinfoGermanous
Imagick::pingImage not implemented im getting an error like thisGeophysics
O
11

You can use qpdf like below. If a file file_name.pdf has 100 pages,

$ qpdf --show-npages file_name.pdf
100
Oira answered 19/8, 2019 at 19:26 Comment(1)
+1 One of the few options that is not licensed under GPL: qpdf.sourceforge.netLegator
M
6

Here is a simple example to get the number of pages in PDF with PHP.

<?php

function count_pdf_pages($pdfname) {
  $pdftext = file_get_contents($pdfname);
  $num = preg_match_all("/\/Page\W/", $pdftext, $dummy);

  return $num;
}

$pdfname = 'example.pdf'; // Put your PDF path
$pages = count_pdf_pages($pdfname);

echo $pages;

?>
Manassas answered 27/10, 2020 at 13:38 Comment(3)
In case of PDFs without incremental updates this may often work.Chucho
Can confirm this works on many occasions. But recently i ran into problems with PDF's consisting of more that 150 pages. E.g for a 179 page PDF, this counts for 181. Other than that, simple and useful.Pancreatin
The reason for the extra pages is likely pdfmarks for Bookmarks. See the bookmarks section of the Adobe pdfmarks referenceSiward
C
3

if you can't install any additional packages, you can use this simple one-liner:

foundPages=$(strings < $PDF_FILE | sed -n 's|.*Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' | sort -rn | head -n 1)
Chemical answered 25/9, 2014 at 5:10 Comment(3)
Could you explain what it does and needs? What is sed -n, sort -rn and head -n? Also it seems that you are looking for /Count <number>, which I showed in my question, doesn't work.Anglian
strings - grabs all strings from PDF binary. Sed - matches values found in 'Count' strings. The (-n) when used in conjunction with (p)rint will avoid repetition of line printing. sort - will take found 'Count' values and sort in (-r)everse order, handling each as a (n)umbers (descending). head - will print first -n line numbers. In this case, 1 (default is 10), which will be the highest 'Count' value. I haven't run across any PDFs that haven't had a Count value. Just luck I guess. Have you verified that your regex is working properly outside of preg_match_all?Peanut
Thank you for your explanation. Yes I have. I have tested a lot of PDFs for this (as I'm mainly working with PDFs, like 100/day) and approx. 40% of all PDFs actually have the Count value. I've also tested this by simply writing the stream to a textfile and search for it (or even parts of it) manually. On some PDFs I've found it, but on most PDFs I didn't.Anglian
D
3

I got problems with imagemagick installations on production server. After hours of attempts, I decided to get rid of IM, and found another approach:

Install poppler-utils:

$ sudo apt install poppler-utils     [On Debian/Ubuntu & Mint]
$ sudo dnf install poppler-utils     [On RHEL/CentOS & Fedora]
$ sudo zypper install poppler-tools  [On OpenSUSE]  
$ sudo pacman -S poppler             [On Arch Linux]

Then execute via shell in your PL ( e.g. PHP):

shell_exec("pdfinfo $filePath | grep Pages | cut -f 2 -d':' | xargs");
Dickenson answered 9/12, 2022 at 13:3 Comment(0)
D
2

This seems to work pretty well, without the need for special packages or parsing command output.

<?php                                                                               

$target_pdf = "multi-page-test.pdf";                                                
$cmd = sprintf("identify %s", $target_pdf);                                         
exec($cmd, $output);                                                                
$pages = count($output);
Dugong answered 1/6, 2017 at 21:40 Comment(1)
Running this commande returns me the followig identify-im6.q16: attempt to perform an operation not allowed by the security policy PDF @ error/constitute.c/IsCoderAuthorized/408. !Jurat
A
2

Since you're ok with using command line utilities, you can use cpdf (Microsoft Windows/Linux/Mac OS X). To obtain the number of pages in one PDF:

cpdf.exe -pages "my file.pdf"
Athletics answered 19/5, 2019 at 2:6 Comment(0)
T
2

I created a wrapper class for pdfinfo in case it's useful to anyone, based on Richard's answer@

/**
 * Wrapper for pdfinfo program, part of xpdf bundle
 * http://www.xpdfreader.com/about.html
 * 
 * this will put all pdfinfo output into keyed array, then make them accessible via getValue
 */
class PDFInfoWrapper {

    const PDFINFO_CMD = 'pdfinfo';

    /**
     * keyed array to hold all the info
     */
    protected $info = array();

    /**
     * raw output in case we need it
     */
    public $raw = "";

    /**
     * Constructor
     * @param string $filePath - path to file
     */
    public function __construct($filePath) {
        exec(self::PDFINFO_CMD . ' "' . $filePath . '"', $output);

        //loop each line and split into key and value
        foreach($output as $line) {
            $colon = strpos($line, ':');
            if($colon) {
                $key = trim(substr($line, 0, $colon));
                $val = trim(substr($line, $colon + 1));

                //use strtolower to make case insensitive
                $this->info[strtolower($key)] = $val;
            }
        }

        //store the raw output
        $this->raw = implode("\n", $output);

    }

    /**
     * get a value
     * @param string $key - key name, case insensitive
     * @returns string value
     */
    public function getValue($key) {
        return @$this->info[strtolower($key)];
    }

    /**
     * list all the keys
     * @returns array of key names
     */
    public function getAllKeys() {
        return array_keys($this->info);
    }

}
Tabling answered 6/2, 2020 at 9:30 Comment(2)
Thinking about this, for security a check that $filePath is valid (i.e. if(!file_exists($filePath)) return false) prior to calling exec() should probably be addedTabling
james well said, I would use is_readable instead of file_exists though :) ThanksScutter
A
1

this simple 1 liner seems to do the job well:

strings $path_to_pdf | grep Kids | grep -o R | wc -l

there is a block in the PDF file which details the number of pages in this funky string:

/Kids [3 0 R 4 0 R 5 0 R 6 0 R 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R 16 0 R 17 0 R 18 0 R 19 0 R 20 0 R 21 0 R 22 0 R 23 0 R 24 0 R 25 0 R 26 0 R 27 0 R 28 0 R 29 0 R 30 0 R 31 0 R 32 0 R 33 0 R 34 0 R 35 0 R 36 0 R 37 0 R 38 0 R 39 0 R 40 0 R 41 0 R]

The number of 'R' characters is the number of pages

screenshot of terminal showing output from strings

Adulteration answered 22/8, 2021 at 21:45 Comment(0)
J
1

You can use mutool.

mutool show FILE.pdf trailer/Root/Pages/Count

mutool is part of the MuPDF software package.

Jillion answered 11/10, 2021 at 8:17 Comment(0)
S
0

Here is a R function that reports the PDF file page number by using the pdfinfo command.

pdf.file.page.number <- function(fname) {
    a <- pipe(paste("pdfinfo", fname, "| grep Pages | cut -d: -f2"))
    page.number <- as.numeric(readLines(a))
    close(a)
    page.number
}
if (F) {
    pdf.file.page.number("a.pdf")
}
Stampede answered 13/8, 2015 at 19:41 Comment(0)
C
0

Here is a Windows command script using gsscript that reports the PDF file page number

@echo off
echo.
rem
rem this file: getlastpagenumber.cmd
rem version 0.1 from commander 2015-11-03
rem need Ghostscript e.g. download and install from http://www.ghostscript.com/download/
rem Install path "C:\prg\ghostscript" for using the script without changes \\ and have less problems with UAC
rem

:vars
  set __gs__="C:\prg\ghostscript\bin\gswin64c.exe"
  set __lastpagenumber__=1
  set __pdffile__="%~1"
  set __pdffilename__="%~n1"
  set __datetime__=%date%%time%
  set __datetime__=%__datetime__:.=%
  set __datetime__=%__datetime__::=%
  set __datetime__=%__datetime__:,=%
  set __datetime__=%__datetime__:/=% 
  set __datetime__=%__datetime__: =% 
  set __tmpfile__="%tmp%\%~n0_%__datetime__%.tmp"

:check
  if %__pdffile__%=="" goto error1
  if not exist %__pdffile__% goto error2
  if not exist %__gs__% goto error3

:main
  %__gs__% -dBATCH -dFirstPage=9999999 -dQUIET -dNODISPLAY -dNOPAUSE  -sstdout=%__tmpfile__%  %__pdffile__%
  FOR /F " tokens=2,3* usebackq delims=:" %%A IN (`findstr /i "number" test.txt`) DO set __lastpagenumber__=%%A 
  set __lastpagenumber__=%__lastpagenumber__: =%
  if exist %__tmpfile__% del %__tmpfile__%

:output
  echo The PDF-File: %__pdffilename__% contains %__lastpagenumber__% pages
  goto end

:error1
  echo no pdf file selected
  echo usage: %~n0 PDFFILE
  goto end

:error2
  echo no pdf file found
  echo usage: %~n0 PDFFILE
  goto end

:error3
  echo.can not find the ghostscript bin file
  echo.   %__gs__%
  echo.please download it from:
  echo.   http://www.ghostscript.com/download/
  echo.and install to "C:\prg\ghostscript"
  goto end

:end
  exit /b
Coccyx answered 3/11, 2015 at 0:17 Comment(0)
L
0

The R package pdftools and the function pdf_info() provides information on the number of pages in a pdf.

library(pdftools)
pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
info <- pdf_info(pdf_file)
nbpages <- info[2]
nbpages

$pages
[1] 65
Lettielettish answered 18/1, 2017 at 22:3 Comment(0)
G
0

If you have access to shell, a simplest (but not usable on 100% of PDFs) approach would be to use grep.

This should return just the number of pages:

grep -m 1 -aoP '(?<=\/N )\d+(?=\/)' file.pdf

Example: https://regex101.com/r/BrUTKn/1

Switches description:

  • -m 1 is neccessary as some files can have more than one match of regex pattern (volonteer needed to replace this with match-only-first regex solution extension)
  • -a is neccessary to treat the binary file as text
  • -o to show only the match
  • -P to use Perl regular expression

Regex explanation:

  • starting "delimiter": (?<=\/N ) lookbehind of /N (nb. space character not seen here)
  • actual result: \d+ any number of digits
  • ending "delimiter": (?=\/) lookahead of /

Nota bene: if in some case match is not found, it's safe to assume only 1 page exists.

Glengarry answered 21/6, 2017 at 15:57 Comment(0)
R
0

This works fine in Imagemagick.

convert image.pdf -format "%n\n" info: | head -n 1

Roselleroselyn answered 9/12, 2022 at 16:48 Comment(0)
R
-1

Often you read regex /\/Page\W/ but it won't work for me for several pdf files. So here is an other regex expression, that works for me.

$pdf = file_get_contents($path_pdf);
return preg_match_all("/[<|>][\r\n|\r|\n]*\/Type\s*\/Page\W/", $path_pdf, $dummy);
Reasonless answered 31/12, 2021 at 9:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.