Convert PDF page to image with PyPDF2 and BytesIO

Asked 11/3, 2017 at 9:27 Answered 12/9, 2024 at 19:34

I have a function that gets a page from a PDF file via PyPDF2 and should convert the first page to a png (or jpg) with Pillow (PIL Fork)

from PyPDF2 import PdfFileWriter, PdfFileReader
import os
from PIL import Image
import io

# Open PDF Source #
app_path = os.path.dirname(__file__)
src_pdf= PdfFileReader(open(os.path.join(app_path, "../../../uploads/%s" % filename), "rb"))

# Get the first page of the PDF #
dst_pdf = PdfFileWriter()
dst_pdf.addPage(src_pdf.getPage(0))

# Create BytesIO #
pdf_bytes = io.BytesIO()
dst_pdf.write(pdf_bytes)
pdf_bytes.seek(0)

file_name = "../../../uploads/%s_p%s.png" % (name, pagenum)
img = Image.open(pdf_bytes)
img.save(file_name, 'PNG')
pdf_bytes.flush()

That results in an error:

OSError: cannot identify image file <_io.BytesIO object at 0x0000023440F3A8E0>

I found some threads with a similar issue, (PIL open() method not working with BytesIO) but I cannot see where I am wrong here, as I have pdf_bytes.seek(0) already added.

Any hints appreciated

Aldrin answered 11/3, 2017 at 9:27 Comment(0)

Per document:

write(stream) Writes the collection of pages added to this object out as a PDF file.

Parameters: stream – An object to write the file to. The object must support the write method and the tell method, similar to a file object.

So the object pdf_bytes contains a PDF file, not an image file.

The reason why there are codes like above work is: sometimes, the pdf file just contains a jpeg file as its content. If your pdf is just a normal pdf file, you can't just read the bytes and parse it as an image.

And refer to as a more robust implementation: https://mcmap.net/q/186734/-extract-images-from-pdf-without-resampling-in-python

Gavingavini answered 11/3, 2017 at 10:48 Comment(0)

I'm not sure if the pdf2image library was available/as mature when the original poster posted but it's 2024 and this seems more elegant with the pdf2image Python library.

You can quickly turn a PDF into a list of images representing each page:

! pip install pdf2image

import pdf2image
images = pdf2image.convert_from_path('input.pdf')

In [1]: type(images[0])
Out [1]: PIL.PpmImagePlugin.PpmImageFile

If you want to save all of the pages to disk:


for i, j in enumerate(images):
    j.save(os.path.join("output_dir", f"page{i}.png"))

Veator answered 12/9, 2024 at 19:34 Comment(0)

-1

[![enter image description here][1]][1]

import glob, sys, fitz

# To get better resolution
zoom_x = 2.0  # horizontal zoom
zoom_y = 2.0  # vertical zoom
mat    = fitz.Matrix(zoom_x, zoom_y)  # zoom factor 2 in each dimension


filename = "/xyz/abcd/1234.pdf"  # name of pdf file you want to render
doc = fitz.open(filename)
for page in doc:
    pix = page.get_pixmap(matrix=mat)  # render page to an image
    pix.save("/xyz/abcd/1234.png")  # store image as a PNG

Credit

[Convert PDF to Image in Python Using PyMuPDF][2]

https://towardsdatascience.com/convert-pdf-to-image-in-python-using-pymupdf-9cc8f602525b

Geisler answered 23/9, 2022 at 9:57 Comment(0)

Recommended topics

Hot tags