Extract a page from a pdf as a jpeg
Asked Answered
S

18

192

In python code, how can I efficiently save a certain page of a PDF as a JPEG file?

Use case: I have a Python flask web server where PDFs will be uploaded and JPEGs corresponding to each page are stored.

This solution is close, but the problem is that it does not convert the entire page to JPEG.

Shanney answered 12/9, 2017 at 19:44 Comment(2)
Depending on the image, it may be better to extract as a png. This would apply if the page contains mainly text.Rowland
Although generally true, the code using fitz that outputs PNG is substantially lower quality than the accepted one using JPG. I suspect the image resolutions are resized per PDF paper size.Rifle
S
231

The pdf2image library can be used.

You can install it simply using,

pip install pdf2image

Once installed you can use following code to get images.

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

Saving pages in jpeg format

for count, page in enumerate(pages):
    page.save(f'out{count}.jpg', 'JPEG')

Edit: the Github repo pdf2image also mentions that it uses pdftoppm and that it requires other installations:

pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install poppler for Windows. Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils.

You can install the latest version under Windows using anaconda by doing:

conda install -c conda-forge poppler

note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/ but note that 0.68 was released in Aug 2018 so you'll not be getting the latest features or bug fixes.

Sew answered 2/2, 2018 at 12:51 Comment(18)
Hi, the poppler is just a zipped file, doesn't install anything, what is one supposed to do with the dll's or the bin files ?Treasure
@gaurwraith: Use the following link to poppler. For some reason the link in the description from Rodrigo is not the same as in the github repo.Jansson
@Keval Dave Have you installed poppler and tried pdf2image on Windows machine? Which Windows please?Expansible
@Expansible I have used this with windows 10 and 64bit machine. Find installation of poppler in windows from answer.Sew
This packages gives a white border to the image so removed it following this stackoverflow questionAbnormal
I've install it but got error: jpeg8.dll not foundHove
I've pretty easily run out of memory doing this - anyone know of a way to just convert a single page (without loading the whole thing, then just using [0] or something)?Sacellum
@Sacellum you can add first_page and last_page in argument of conver_from_path function to convert specified page onlySew
Thanks for the heads up on those arguments, however I still get the same issue (I believe it's with memory, the traceback isn't helpful). I'm wondering if first_page / last_page still requires loading the full PDF into memory and then internally just parses out the required pages.Sacellum
Is the '500' the dpi? Just wondering what your reason for going to 500 dpi would be, it looks like 300 is the standard.Elyseelysee
@Jacob 500 is the dpi. It tradeoff on the resolution required and the computation available. In my experiments, 500 worked well most of the cases while 300 got me low rez images.Sew
I used conda install -c conda-forge poppler to install poppler and it worked.Destinydestitute
For converting the first page of the PDF and nothing else, this works:from pdf2image import convert_from_path pages = convert_from_path('file.pdf', 500) pages = convert_from_path('file.pdf', 500, single_file=True) pages[0].save('file.jpg', 'JPEG')Orose
And there is a nice line in poppler docs: "You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path." thought in my case (conda install) it was actually C:\ProgramData\Anaconda3\pkgs\poppler-21.09.0-h24fffdf_1\Library\bin.Mudra
If using mac, you can install both packages needed using conda conda install poppler conda install pdf2imageMafaldamafeking
Get stuck on some pdfDaria
Poppler's license is GPL based. Be careful in the commercial setting!Unduly
This is probably the worst way if you are doing it for many pdfs. It stores images in ppms alongwith jpeg which itself are around 50 megabytes for each page of your pdf. It has a known issue of memory overhaul.Buller
H
150

I found this simple solution, PyMuPDF, output to png file. Note the library is imported as "fitz", a historical name for the rendering engine it uses.

import fitz

pdffile = "infile.pdf"
doc = fitz.open(pdffile)
page = doc.load_page(0)  # number of page
pix = page.get_pixmap()
output = "outfile.png"
pix.save(output)
doc.close()

Note: The library changed from using "camelCase" to "snake_cased". If you run into an error that a function does not exist, have a look under deprecated names. The functions in the example above have been updated accordingly.

The fitz.Document class supports a context manager initialization:

with fitz.open(pdffile) as doc:
   ...
Hobbes answered 2/4, 2019 at 17:27 Comment(12)
Please add explanation to your answer.Yet
A good library and it installs on Windows 10 without problems (no wheels required). github.com/pymupdfInfluent
This is the BEST answer. This was the only code that didn't require an additional installation onto my OS. Python scripts should focus on working within the Python system. I did not need to install poppler, pdftoppm, imageMagick or ghostscript, etc. (Python 3.6)Kudos
Actually it requires another installation (fitz library, imported without even being referred to and its dependencies), this answer is incomplete (like all of the answers at this question)Vallation
@TommasoGuerrini no. From the docs: "The standard Python import statement for this library is import fitz. This has a historical reason..." <fitz> is another library, something about neuroimaging. The code works as expected.Hoard
@Hobbes Instead of pdf file taken from the path, can we take from pdfurl? Also, is it possible for the png file to be in-stream data rather than output-png file?Boss
image = page.getPixmap(matrix=fitz.Matrix(150/72,150/72)) extracts the image at 150 DPI. Issue question on this topic.Chrome
This solution uses code licensed commercially by Artifix Software, as well as open-source by AGPL licensing. Be wary of using this on your project, especially if it's commercial in nature. You may need to dig deeper into the legal implications.Zoster
The perfect solution no dependency it needs . no poppler, no want nnothing elseFinance
for jpeg, I used pil_save instead of saveTruett
You saved my life! [tears of joy], tried almost a thousand libraries (wand, svglib, cairosvg, pdf2image, pdf2files, etc.) Each one needed another program to run, download exe on Windows, sudo on Linux, add to path... But this one is magic!!! you can even use page.get_pixmap(dpi=300) to get a 5921×1734 PNG file!!! I'm in love with this 💖.Snowflake
Although code works, for some reason the extracted image is much lower quality than the original. I can't find anything obvious that would cause the quality degradation.Rifle
P
47

Using pypdfium2 (v4):

python3 -m pip install "pypdfium2==4" pillow
import pypdfium2 as pdfium

# Load a document
pdf = pdfium.PdfDocument("tests/resources/multipage.pdf")

# Loop over pages and render
for i in range(len(pdf)):
    page = pdf[i]
    image = page.render(scale=4).to_pil()
    image.save(f"output_{i:03d}.jpg")

Advantages:

  • PDFium is liberal-licensed (BSD 3-Clause, Apache 2.0)
  • It is fast, outperforming Poppler. In terms of speed, pypdfium2 can almost reach PyMuPDF
  • Returns PIL.Image.Image, numpy.ndarray, or a ctypes array, depending on your needs
  • Is capable of processing encrypted (password-protected) PDFs
  • No mandatory runtime dependencies
  • Supports Python >= 3.6
  • Setup infrastructure complies with PEP 517/518

Wheels are currently available for

  • Windows amd64, win32, arm64
  • macOS x86_64, arm64
  • Linux (glibc) x86_64, i686, aarch64, armv7l
  • Linux (musl) x86_64, i686, aarch64

There is a script to build from source, too.

(Disclaimer: I'm the author)

Pinna answered 4/12, 2021 at 18:23 Comment(18)
This is the solution that worked best for me since it didn't require any other installation on python 3.9.13 and windows 10. You should add how to import pdfium in your reply: import pypdfium2 as pdfiumAustralian
Added, thanks! I believe it initially was part of the post but might have got lost during an edit. (I updated this reply several times due to API changes.)Pinna
@FrancescoPettini AFAIK, pymupdf doesn't require any external dependencies, either. Technically, it's yet a bit better than pypdfium2, so if you don't mind the AGPL, you could give that one a try, too.Pinna
installing pymupdf via fitz required me to install frontend, which if I remember correctly required other packages tooAustralian
@FrancescoPettini The docs say that pymupdf doesn't have any mandatory external runtime dependencies if installing from the binary wheels.Pinna
Trying to use the multi-page render here nets me a "An attempt has been made to start a new process before the current process has finished its bootstrapping phase."Harbaugh
@Harbaugh It looks like you may be calling the function in a special context where it is not possible to set up a new process pool. Consider using the single page renderer, or file a more detailed bug report on GitHub.Pinna
This should be the accepted answer, thanks for your work. No need of any extra installation, pip install pypdfium2 is enough.Reniti
This works great, but when using pyinstaller to create an exe, when I run the exe, it can't find "pdfium", which is referring to pypdfium2 (I checked the line that threw the error). Any idea as to how to fix this?Unduly
@Unduly pypdfium2 contains a binary extension, and you need to configure pyinstaller to take that along. The pyinstaller docs provide information on how to do this. I never used pyinstaller myself but had a similar issue report once and the user was able to fix it somehow (github.com/pypdfium2-team/pypdfium2/issues/120).Pinna
@Pinna Yes. I actually figured this out a few hours after I posted this... --collect-all pypdfium2 as a cmd line option should work, but I settled for --add-data "C:\Program Files\Python39\Lib\site-packages\pypdfium2\pdfium.dll";. (The "." at the end is intentional).Unduly
@Unduly Great, thanks for letting me know! Actually, I'm intrigued why pyinstaller doesn't automatically include the binary. After all, the file wouldn't be in the package directory if it wasn't needed.Pinna
@Pinna That's what I was saying!!! The only reason I stumbled into the answer was because I went searching into Lib\site-packages for "pdfium" because I wasn't importing any libraries called pdfium. I thought that if it was a dependency then I'd see it in site-packages. Low and behold it wasn't, so I thought I'd explore pypdfium2's folder... and what do you know... pdfium.dll. Soooo annoying.Unduly
@Unduly Sorry for the inconveniences. I'm wondering if there's anything I can do to improve the situation. pypdfium2's setup code is a bit non-standard because setuptools extensions don't work for external binaries, they're only meant for in-place compilation. That's why we currently have to camouflage binaries as package data. Maybe pyinstaller would work correctly if it was an official extension, but I feel like package data should be included in any case...Pinna
Oh, no - you're definitely right. That dll should've been included with pyinstaller - so I mean its not your fault. I can't think of a practical way on your end to alert users that are trying to include the library in pyinstaller (or of the likes) that they'd have to set that flag. I think the best you could get is to include it in the readme - but thats not worth IMO creating another branch in github. One more note. I noticed that when I open a pdf, and get a page with get_page() after pdf.close(), it doesn't close the page. So a move operation throws an error, because the page is in use.Unduly
@Unduly Adding a note to the readme is a good idea, I'll do that. Concerning the problem you mention, I'm afraid I don't quite understand yet. Once you have called pdf.close(), no resources associated to that document handle may be accessed anymore, including loaded pages. Objects need to be closed in reverse order compared to loading (i. e. first the page, then the pdf). I'm not sure if I understood your problem correctly, though. If this information isn't sufficient, could you file a bug report on GitHub to elaborate?Pinna
No, then I had false assumptions. I figured that when you close()d a PdfDocument() it'd kill its children, but maybe those instances aren't tied to the PdfDocument() at all? IDK, I remember trying to delete the document then the page, but I could've sworn that I did switched it and tried it the other way. But that's neither here nor there, since I got it working. Not that its a huge deal, but maybe in a future release, consider keeping the reference to the children and close() them when the parent pdf is close()d. Real quick, do you keep vector data on rendering a pdf or rasterize?Unduly
May I add a snippet --- you may need to install PILLOW for pypdfium2 to work properly as described in the code example above. I certainly had to.Huppert
C
30

The Python library pdf2image (used in the other answer) in fact doesn't do much more than just launching pdttoppm with subprocess.Popen, so here is a short version doing it directly:

PDFTOPPMPATH = r"D:\Documents\software\____PORTABLE\poppler-0.51\bin\pdftoppm.exe"
PDFFILE = "SKM_28718052212190.pdf"

import subprocess
subprocess.Popen('"%s" -png "%s" out' % (PDFTOPPMPATH, PDFFILE))

Here is the Windows installation link for pdftoppm (contained in a package named poppler): http://blog.alivate.com.au/poppler-windows/.

Coligny answered 22/5, 2018 at 21:33 Comment(1)
Hi, the Windows installation link for pdftoppm is just a buncho of zipped files, what do you have to do with them to make them work ? Thanks!Treasure
B
17

There is no need to install Poppler on your OS. This will work:

pip install Wand

from wand.image import Image

f = "somefile.pdf"
with(Image(filename=f, resolution=120)) as source: 
    for i, image in enumerate(source.sequence):
        newfilename = f.removesuffix(".pdf") + str(i + 1) + '.jpeg'
        Image(image).save(filename=newfilename)
Belting answered 6/2, 2019 at 1:15 Comment(4)
ImageMagick library needs to be installed to work on wand.Brackett
I tried this and needed to install Ghostscript as well (using Windows 10 and Python 3.7). Did it and it worked perfectly.Isidore
whats the f[:-4] for? its not referenced anywhere elsePescara
@Pescara f[:-4] will cut of ".pdf" from filename ( string slicing ) to create new filename with other ext.Sub
M
13

@gaurwraith, install poppler for Windows and use pdftoppm.exe as follows:

  1. Download zip file with Poppler's latest binaries/dlls from http://blog.alivate.com.au/poppler-windows/ and unzip to a new folder in your program files folder. For example: "C:\Program Files (x86)\Poppler".

  2. Add "C:\Program Files (x86)\Poppler\poppler-0.68.0\bin" to your SYSTEM PATH environment variable.

  3. From cmd line install pdf2image module -> "pip install pdf2image".

  4. Or alternatively, directly execute pdftoppm.exe from your code using Python's subprocess module as explained by user Basj.

@vishvAs vAsuki, this code should generate the jpgs you want through the subprocess module for all pages of one or more pdfs in a given folder:

import os, subprocess

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

pdftoppm_path = r"C:\Program Files (x86)\Poppler\poppler-0.68.0\bin\pdftoppm.exe"

for pdf_file in os.listdir(pdf_dir):

    if pdf_file.endswith(".pdf"):

        subprocess.Popen('"%s" -jpeg %s out' % (pdftoppm_path, pdf_file))

Or using the pdf2image module:

import os
from pdf2image import convert_from_path

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

    for pdf_file in os.listdir(pdf_dir):

        if pdf_file.endswith(".pdf"):

            pages = convert_from_path(pdf_file, 300)
            pdf_file = pdf_file[:-4]

            for page in pages:

               page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")
Mirilla answered 24/11, 2018 at 22:46 Comment(2)
This helped a lot. Thanks!Cantaloupe
This should actually be the accepted answer. Shows what to do with the installed binaries for PopplerGensmer
S
8

GhostScript performs much faster than Poppler for a Linux based system.

Following is the code for pdf to image conversion.

def get_image_page(pdf_file, out_file, page_num):
    page = str(page_num + 1)
    command = ["gs", "-q", "-dNOPAUSE", "-dBATCH", "-sDEVICE=png16m", "-r" + str(RESOLUTION), "-dPDFFitPage",
               "-sOutputFile=" + out_file, "-dFirstPage=" + page, "-dLastPage=" + page,
               pdf_file]
    f_null = open(os.devnull, 'w')
    subprocess.call(command, stdout=f_null, stderr=subprocess.STDOUT)

GhostScript can be installed on macOS using brew install ghostscript

Installation information for other platforms can be found here. If it is not already installed on your system.

Sew answered 7/1, 2020 at 12:29 Comment(2)
Just to let everyone know, Ghostscript is based on AGPL License and might need permissions in case used within commercial projects. For more reference, read ghostscript.com/license.html.Wouldbe
How do you get to the conclusion that Ghostscript is "much faster" than Poppler? I can't reproduce this observation in my personal benchmarks. In fact, I found Ghostscript to be slightly slower.Pinna
E
5

Their is a utility called pdftojpg which can be used to convert the pdf to img

You can found the code here https://github.com/pankajr141/pdf2jpg

from pdf2jpg import pdf2jpg
inputpath = r"D:\inputdir\pdf1.pdf"
outputpath = r"D:\outputdir"
# To convert single page
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1")
print(result)

# To convert multiple pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1,0,3")
print(result)

# to convert all pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="ALL")
print(result)
Extempore answered 30/7, 2018 at 15:17 Comment(2)
did this java thing just delete my whole folder full of pdf manipulating python scripts....?Wiseacre
An alternative binding to Apache PDFBox is github.com/lebedov/python-pdfboxPinna
P
4

One problem everyone will face that is to Install Poppler. My way is a tricky way,but will work efficiently.

1st download Poppler here.

Then extract it and in the code section just add poppler_path=r'C:\Program Files\poppler-0.68.0\bin' (for eg.) like below

from pdf2image import convert_from_path
images = convert_from_path("mypdf.pdf", 500,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
for i, image in enumerate(images):
    fname = 'image'+str(i)+'.png'
    image.save(fname, "PNG")
Painless answered 10/12, 2020 at 14:19 Comment(1)
This will produce an image per page with the i argument. It works really well. Thank you!Cleanly
R
4

Here is a function that does the conversion of a PDF file with one or multiple pages to a single merged JPEG image.

import os
import tempfile
from pdf2image import convert_from_path
from PIL import Image

def convert_pdf_to_image(file_path, output_path):
    # save temp image files in temp dir, delete them after we are finished
    with tempfile.TemporaryDirectory() as temp_dir:
        # convert pdf to multiple image
        images = convert_from_path(file_path, output_folder=temp_dir)
        # save images to temporary directory
        temp_images = []
        for i in range(len(images)):
            image_path = f'{temp_dir}/{i}.jpg'
            images[i].save(image_path, 'JPEG')
            temp_images.append(image_path)
        # read images into pillow.Image
        imgs = list(map(Image.open, temp_images))
    # find minimum width of images
    min_img_width = min(i.width for i in imgs)
    # find total height of all images
    total_height = 0
    for i, img in enumerate(imgs):
        total_height += imgs[i].height
    # create new image object with width and total height
    merged_image = Image.new(imgs[0].mode, (min_img_width, total_height))
    # paste images together one by one
    y = 0
    for img in imgs:
        merged_image.paste(img, (0, y))
        y += img.height
    # save merged image
    merged_image.save(output_path)
    return output_path

Example usage: -

convert_pdf_to_image("path_to_Pdf/1.pdf", "output_path/output.jpeg")

Russellrusset answered 17/1, 2021 at 8:17 Comment(1)
Just curious, why for i, img in enumerate(imgs): total_height += imgs[i].height instead of simply for img in imgs: total_height += img.height ?Schlesien
P
2

I wrote this script to easily convert a folder directory that contains PDFs (single page) to PNGs really nicely.

import os
from pathlib import PurePath
import glob
# from PIL import Image
from pdf2image import convert_from_path
import pdb

# In[file list]

wd = os.getcwd()

# filter images
fileListpdf = glob.glob(f'{wd}//*.pdf')

# In[Convert pdf to images]

for i in fileListpdf:
    
    images = convert_from_path(i, dpi=300)
    
    path_split = PurePath(i).parts
    fileName, ext = os.path.splitext(path_split[-1])
    
    images[0].save(f'{fileName}.png', 'PNG')

Hopefully, this helps if you need to convert PDFs to PNGs!

Pettigrew answered 18/5, 2021 at 18:47 Comment(1)
unrelated, fwiw, you can also do pathlib.Path.cwd()Saprophyte
G
1

I use a (maybe) much simpler option of pdf2image:

cd $dir
for f in *.pdf
do
  if [ -f "${f}" ]; then
    n=$(echo "$f" | cut -f1 -d'.')
    pdftoppm -scale-to 1440 -png $f $conv/$n
    rm $f
    mv  $conv/*.png $dir
  fi
done

This is a small part of a bash script in a loop for the use of a narrow casting device. Checks every 5 seconds on added pdf files (all) and processes them. This is for a demo device, at the end converting will be done at a remote server. Converting to .PNG now, but .JPG is possible too.

This converting, together with transitions on A4 format, displaying a video, two smooth scrolling texts and a logo (with transition in three versions) sets the Pi3 to allmost 4x 100% cpu-load ;-)

Generality answered 30/7, 2019 at 6:48 Comment(1)
The question is about rendering a PDF with Python, not bash.Pinna
B
-1
from pdf2image import convert_from_path
import glob

pdf_dir = glob.glob(r'G:\personal\pdf\*')  #your pdf folder path
img_dir = "G:\\personal\\img\\"           #your dest img path

for pdf_ in pdf_dir:
    pages = convert_from_path(pdf_, 500)
    for page in pages:
        page.save(img_dir+pdf_.split("\\")[-1][:-3]+"jpg", 'JPEG')
Biak answered 23/5, 2019 at 7:7 Comment(2)
This would be a better answer if you explained how the code you provided answers the question.Highpriced
@Highpriced Python is fairly readable, the comments do indicate the source folder and output folder, the rest reads like english.Pescara
M
-1

Here is a solution which requires no additional libraries and is very fast. This was found from: https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html# I have added the code in a function to make it more convenient.

def convert(filepath):
    with open(filepath, "rb") as file:
        pdf = file.read()

    startmark = b"\xff\xd8"
    startfix = 0
    endmark = b"\xff\xd9"
    endfix = 2
    i = 0

    njpg = 0
    while True:
        istream = pdf.find(b"stream", i)
        if istream < 0:
            break
        istart = pdf.find(startmark, istream, istream + 20)
        if istart < 0:
            i = istream + 20
            continue
        iend = pdf.find(b"endstream", istart)
        if iend < 0:
            raise Exception("Didn't find end of stream!")
        iend = pdf.find(endmark, iend - 20)
        if iend < 0:
            raise Exception("Didn't find end of JPG!")

        istart += startfix
        iend += endfix
        jpg = pdf[istart:iend]
        newfile = "{}jpg".format(filepath[:-3])
        with open(newfile, "wb") as jpgfile:
            jpgfile.write(jpg)

        njpg += 1
        i = iend

        return newfile

Call convert with the pdf path as the argument and the function will create a .jpg file in the same directory

Magda answered 17/3, 2020 at 11:31 Comment(1)
This technique looks like it extracts images that have been embedded in the file, rather than rasterizing a page of the file as an image which is what the questioner wanted.Snowball
S
-1

For a pdf file with multiple pages, the following is the best & simplest (I used pdf2image-1.14.0):

from pdf2image import convert_from_path
from pdf2image.exceptions import (
     PDFInfoNotInstalledError,
     PDFPageCountError,
     PDFSyntaxError
     )
        
images = convert_from_path(r"path/to/input/pdf/file", output_folder=r"path/to/output/folder", fmt="jpg",) #dpi=200, grayscale=True, size=(300,400), first_page=0, last_page=3)
        
images.clear()

Note:

  1. "images" is a list of PIL images.
  2. The saved images in the output folder will have system generated names; one can later change them, if required.
Strongarm answered 15/3, 2021 at 17:11 Comment(2)
Why is this "the best" ?Feudatory
1) Fast as, no loop is required. 2) All the required parameters (like dpi, format, grayscale option, size etc.) are processed at one run. 3) Built-in exception handling is there. 4) The core function calling is only a single line statement. 5) You can get images as 'saved' files as well as a 'list' of 'matrices'.Strongarm
E
-1

This easy script can convert a folder directory that contains PDFs (single/multiple pages) to jpeg.

from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
from os import listdir
from os import system
from os.path import isfile, join, basename, dirname
import shutil

def move_processed_file(file, doc_path, download_processed):
    try:
        shutil.move(doc_path + '/' + file, download_processed + '/' + file)
        pass
    except Exception as e:
        print(e.errno)
        raise
    else:
        pass
    finally:
        pass
    pass


def run_conversion():
    root_dir = os.path.abspath(os.curdir)

    doc_path = root_dir + r"\data\download"
    pdf_processed = root_dir + r"\data\download\pdf_processed"
    results_folder = doc_path

    files = [f for f in listdir(doc_path) if isfile(join(doc_path, f))]

    pdf_files = [f for f in listdir(doc_path) if isfile(join(doc_path, f)) and f.lower().endswith('.pdf')]

    # check OS type
    if os.name == 'nt':
        # if is windows or a graphical OS, change this poppler path with your own path
        poppler_path = r"C:\poppler-0.68.0\bin"
    else:
        poppler_path = root_dir + r"\usr\bin"

    for file in pdf_files:

        ''' 
        # Converting PDF to images 
        '''

        # Store all the pages of the PDF in a variable
        pages = convert_from_path(doc_path + '/' + file, 500, poppler_path=poppler_path)

        # Counter to store images of each page of PDF to image
        image_counter = 1

        filename, file_extension = os.path.splitext(file)

        # Iterate through all the pages stored above
        for page in pages:
            # Declaring filename for each page of PDF as JPG
            # PDF page n -> page_n.jpg
            filename = filename + '_' + str(image_counter) + ".jpg"

            # Save the image of the page in system
            page.save(results_folder + '/' + filename, 'JPEG')

            # Increment the counter to update filename
            image_counter += 1

        move_processed_file(file, doc_path, pdf_processed)


Ecdysis answered 12/4, 2022 at 9:57 Comment(0)
P
-1

Following pdf2image documentation in 2024. Just remember to install poppler

convert_from_path returns a list with all the pages of the pdf converted to .ppm, then define the file name and save the first page defined in image_list[0] as JPEG. If you want to save all pages, just iterate over image_list

import os
from pdf2image import convert_from_path

pdf_folder = 'path/to/pdfs'
img_folder = 'path/to/save/imgs'

for file in os.listdir(pdf_folder):
    if file.endswith('.pdf'):
        pdf_path = os.path.join(pdf_folder, file)

        with open(pdf_path, 'rb') as pdf_arquivo:
            name = os.path.splitext(file)[0]            
            image_list = convert_from_path(pdf_path, poppler_path='C:/Poppler/bin')
            img_path = os.path.join(img_folder, f'{name}.jpg')
            image_list[0].save(img_path, 'JPEG')

print("Finished!")
Perot answered 12/3 at 17:56 Comment(0)
H
-3
from pdf2image import convert_from_path

PDF_file = 'Statement.pdf'
pages = convert_from_path(PDF_file, 500,userpw='XXX')

image_counter = 1

for page in pages:

    filename = "foldername/page_" + str(image_counter) + ".jpg"
    page.save(filename, 'JPEG')
    image_counter = image_counter + 1
Hall answered 14/4, 2021 at 5:36 Comment(1)
Posting a poorly formatted, incorrectly indented answer with no explanation as to how your answer works or what benefits it offers compared to the 13 existing answers, is of very little value as it stands. Please edit your answer, fix the formatting (the formatting help can assist you), fix the indentation, and add some explanation.Confucianism

© 2022 - 2024 — McMap. All rights reserved.