How to get PDF file metadata 'Page Size' using Python?
Asked Answered
J

3

5

I try to use PyPDF2 module in Python 3 but I can't display 'Page Size' property. I would like to know what the sheet of paper dimensions were before scanning to PDF file.

Something like this:

import PyPDF2
pdf=PdfFileReader("sample.pdf","rb")
print(pdf.getNumPages())

But I'm looking for another Python function instead of for example getNumPages()...

This command below prints some kind of metadata but without page size:

pdf_info=pdf.getDocumentInfo()
print(pdf_info)
Jellicoe answered 15/9, 2017 at 6:22 Comment(0)
I
8

This code should help you:

import PyPDF2
pdf = PyPDF2.PdfFileReader("a.pdf","rb")
p = pdf.getPage(1)

w_in_user_space_units = p.mediaBox.getWidth()
h_in_user_space_units = p.mediaBox.getHeight()

# 1 user space unit is 1/72 inch
# 1/72 inch ~ 0.352 millimeters

w = float(p.mediaBox.getWidth()) * 0.352
h = float(p.mediaBox.getHeight()) * 0.352
Igniter answered 16/9, 2017 at 9:34 Comment(1)
the ~0.352 is exactly 25.4/72Beerbohm
N
2

Here's a more up-to-date flavor using pypdf:

from pypdf import PdfReader

pdf = PdfReader("a.pdf")
page = pdf.pages[1]

cm_per_inch = 2.54
points = 72

width_in_user_space_units = page.mediabox.width
height_in_user_space_units = page.mediabox.height

width_in_cm = float(width_in_user_space_units) / points * cm_per_inch
height_in_cm = float(height_in_user_space_units) / points * cm_per_inch

Nichol answered 21/3, 2023 at 12:55 Comment(0)
K
0

GET "sheet of paper dimensions were before scanning to PDF file"

Is not really possible since scanners will be set to an output media size without the scanned media being known.

Take for examples

  • A letter sheet of paper placed on an A4 scanner bed or visa versa. The trace of the paper edge may or may not be visible in the output. The scanner simply works blind of the "source media", and for a document of mixed rotations, may need post processing to rescale some pages or rotate to upright.

  • Another example is using a mobile phone to scan a docket, it can be any source size, but the user software will determine the storage media size and rotation for PAGE file save. A5 A4 A3 whatever Portrait or Landscape.

Thus all you can ask from a PDF is, what is the stored PAGE size and display resolution, often varying between pages, and without confirming the source rotation.

For a simple list of stored page sizes there are several command line utilities that can list page variations.

Shell a one line command tool like xpdf/poppler pdfinfo to parse all different types of PDF and then parse that output. The output is similar for both with many lines but for your need

xpdf\pdfinfo -box filename
gives Page size: 594.96 x 841.92 pts (A4) (rotated 0 degrees)
and
poppler\pdfinfo -box filename
gives Page size: 594.96 x 841.92 pts (A4)

when scanning it is common to get size variation across the pages

Page    2 size: 595 x 842 pts (A4) (rotated 0 degrees)
Page    3 size: 595.32 x 841.92 pts (A4) (rotated 0 degrees)
Page    4 size: 595.44 x 842.04 pts (A4) (rotated 0 degrees)
Page    5 size: 595.44 x 842.04 pts (A4) (rotated 0 degrees)
Page    6 size: 595.2 x 841.9 pts (A4) (rotated 0 degrees)
Page    7 size: 595.45 x 841.9 pts (A4) (rotated 0 degrees)
Page    8 size: 595.45 x 841.9 pts (A4) (rotated 0 degrees)
Page    9 size: 595.2 x 841.44 pts (rotated 0 degrees)
Kilkenny answered 8/7, 2023 at 10:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.