Extract images from PDF in high resolution with Python
Asked Answered
S

3

11

I have managed to extract images from several PDF pages with the below code, but the resolution is quite low. Is there a way to adjust that?

import fitz    
pdffile = "C:\\Users\\me\\Desktop\\myfile.pdf"
doc = fitz.open(pdffile)
for page_index in range(doc.pageCount):
    page = doc.loadPage(page_index)  
    pix = page.getPixmap()
    output = "image_page_" + str(page_index) + ".jpg"
    pix.writePNG(output)

I have also tried using the code here and updated if pix.n < 5" to "if pix.n - pix.alpha < 4 but this didn't output any images in my case.

Sassy answered 10/9, 2020 at 0:20 Comment(0)
E
13

As stated in this issue for PyMuPDF, you have to use a matrix: issue on Github.

The example given is:

zoom = 2    # zoom factor
mat = fitz.Matrix(zoom, zoom)
pix = page.getPixmap(matrix = mat, <...>)

Indicated in the issue is also that the default resolution is 72 dpi if you don't use a matrix which likely explains your getting low resolution.

Erepsin answered 10/9, 2020 at 6:13 Comment(1)
Please everyone be reminded that camelCased names in PyMuPDF have been renamed to snake_cased versions - plus a few completely new names like pix.writePNG became pix.save and more like that. This happened in v1.18.4 and is mandatory since 1.20.0.Mallissa
P
7

Even simpler than making a matrix, the documentation for getPixmap() shows that you can use the dpi argument for higher resolution:

pix = page.getPixmap(dpi=200)

This is new as of v1.19.2.

Papain answered 18/10, 2022 at 19:24 Comment(1)
Thanks, i increased DPI to 500 and quality improved very much !!Occasionally
A
1

To get the best quality, use 'matrix' and 'dpi'. This code solve the problem of higher resolution of the result. I implement a solution to convert all files at the folder with the best quality:

# pip install fitz
# pip install pip install PyMuPDF==1.19.0

import fitz
import glob

for filename in glob.glob("*.pdf"):
    pdffile = filename
    doc = fitz.open(pdffile)
    for page_index in range(doc.pageCount):
        try:
            page = doc.load_page(page_index)  
            zoom = 2 
            mat = fitz.Matrix(zoom, zoom)
            pix = page.get_pixmap(matrix = mat,dpi=1200)
            output = '_' + filename.replace(".pdf","") + "-" + str(page_index) + ".png"
            pix.save(output)
        except Exception as e:
            print(str(filename) + ' > ' + str(e))    
    doc.close()
Aestival answered 22/6, 2023 at 23:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.