Converting pdf to png with python (without pdf2image)

Asked 20/10, 2021 at 10:1 Answered 21/6, 2023 at 7:56

I want to convert a pdf (one page) into a png file. I installed pdf2image and get this error: popler is not installed in windows.

According to this question: Poppler in path for pdf2image, poppler should be installed and PATH modified.

I cannot do any of those (I don't have the necessary permissions in the system I am working with).

I had a look at opencv and PIL and none seems to offer the possibility to make this transformation: PIL (see here https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html?highlight=pdf#pdf) does not offer the possibility to read pdfs, only to save images as pdfs. The same goes for openCV.

Any suggestion how to make the pdf to png transformation ? I can install any python library but I can not touch the windows installation.

thanks

Swordtail answered 20/10, 2021 at 10:1 Comment(2)

I HAVE to do it in python because I can only connect to the APIs from a Jupyter Hub environment, and it has to be done on the fly. – Swordtail 20/10, 2021 at 15:5

Lucky you, thank the admins for protecting your code from infection by poppler's "viral" copyleft (GPL) license – Aircondition 20/5, 2023 at 11:37

PyMuPDF supports pdf to image rasterization without requiring any external dependencies.

Sample code to do a basic pdf to png transformation:

import fitz  # PyMuPDF, imported as fitz for backward compatibility reasons
file_path = "my_file.pdf"
doc = fitz.open(file_path)  # open document
for i, page in enumerate(doc):
    pix = page.get_pixmap()  # render page to an image
    pix.save(f"page_{i}.png")

Interesting answered 20/10, 2021 at 10:23 Comment(6)

Hi @Interesting but you are importing a my_file.png, I understand that it could be a pdf right? – Swordtail 20/10, 2021 at 15:16

That was indeed a typo, fixed it! – Interesting 20/10, 2021 at 16:9

How can you just convert first 10 pages ? – Marginalia 7/12, 2021 at 7:24

doc is indexable, so you can just use a for loop: for i in range(10), and set page=doc[i]. – Interesting 7/12, 2021 at 17:36

Thanks for your competent comments, @Interesting - just an addition: the new PyMuPDF version 1.22.0 also supports saving to JPEG directly, without having to use Pillow: pix.save("file.jpg", jpg_quality=n). As can be seen, the JPEG quality can be chosen with an additional parameter. – Chukker 17/4, 2023 at 11:3

Note it is licensed under AGPL, which still requires you to disclose source, like GPL-licensed poppler called by pdf2image (and network use is deemed to be distribution). – Aircondition 20/5, 2023 at 11:59

Here is a snippet that generates PNG images of arbitrary resolution (dpi):

import fitz
file_path = "my_file.pdf"
dpi = 300  # choose desired dpi here
zoom = dpi / 72  # zoom factor, standard: 72 dpi
magnify = fitz.Matrix(zoom, zoom)  # magnifies in x, resp. y direction
doc = fitz.open(fname)  # open document
for page in doc:
    pix = page.get_pixmap(matrix=magnify)  # render page to an image
    pix.save(f"page-{page.number}.png")

Generates PNG files name page-0.png, page-1.png, ... By choosing dpi < 72 thumbnail page images would be created.

Chukker answered 20/10, 2021 at 22:18 Comment(3)

second row should be fname =, not file_path = – Sally 2/12, 2022 at 2:58

From their rtd (pymupdf.readthedocs.io/en/latest/recipes-images.html): "Since version 1.19.2 there is a more direct way to set the resolution: Parameter "dpi" (dots per inch) can be used in place of "matrix". To create a 300 dpi image of a page specify pix = page.get_pixmap(dpi=300). Apart from notation brevity, this approach has the additional advantage that the dpi value is saved with the image file – which does not happen automatically when using the Matrix notation." – Glorification 15/4, 2023 at 23:34

Note the fitz Github repo has been archived by the owner on Aug 3, 2022. It is now read-only. The only version on PyPI is a 5-year-old version tagged "pre-release":) – Aircondition 20/5, 2023 at 12:2

import fitz

input_pdf = r"Samples\104295.pdf"

output_jpg = r"Output\104295.jpg"

#The code splits the first page of pdf and converts to jpeg
def split_and_convert(pdf_path, output_path):
    doc = fitz.open(pdf_path)
    page = doc.load_page(0)
    pix = page.get_pixmap()
    pix.save(output_path, "jpeg")
    doc.close()

split_and_convert(input_pdf, output_jpg)

Recidivism answered 21/6, 2023 at 7:56 Comment(1)

Please add details explaining what your answer does and how it solves the problem, in addition to your code. – Finished 21/6, 2023 at 22:38

Recommended topics

Hot tags