Extract a page from a pdf as a jpeg

Asked 12/9, 2017 at 19:44 Answered 12/3 at 17:56

192

In python code, how can I efficiently save a certain page of a PDF as a JPEG file?

Use case: I have a Python flask web server where PDFs will be uploaded and JPEGs corresponding to each page are stored.

This solution is close, but the problem is that it does not convert the entire page to JPEG.

Shanney answered 12/9, 2017 at 19:44 Comment(2)

Depending on the image, it may be better to extract as a png. This would apply if the page contains mainly text. – Rowland 18/6, 2020 at 5:54

Although generally true, the code using fitz that outputs PNG is substantially lower quality than the accepted one using JPG. I suspect the image resolutions are resized per PDF paper size. – Rifle 6/4, 2023 at 2:13

231

The pdf2image library can be used.

You can install it simply using,

pip install pdf2image

Once installed you can use following code to get images.

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

Saving pages in jpeg format

for count, page in enumerate(pages):
    page.save(f'out{count}.jpg', 'JPEG')

Edit: the Github repo pdf2image also mentions that it uses pdftoppm and that it requires other installations:

pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install poppler for Windows. Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils.

You can install the latest version under Windows using anaconda by doing:

conda install -c conda-forge poppler

note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/ but note that 0.68 was released in Aug 2018 so you'll not be getting the latest features or bug fixes.

Sew answered 2/2, 2018 at 12:51 Comment(18)

Hi, the poppler is just a zipped file, doesn't install anything, what is one supposed to do with the dll's or the bin files ? – Treasure 26/8, 2018 at 21:59

@gaurwraith: Use the following link to poppler. For some reason the link in the description from Rodrigo is not the same as in the github repo. – Jansson 9/10, 2018 at 7:20

@Keval Dave Have you installed poppler and tried pdf2image on Windows machine? Which Windows please? – Expansible 27/11, 2018 at 15:8

@Expansible I have used this with windows 10 and 64bit machine. Find installation of poppler in windows from answer. – Sew 29/11, 2018 at 9:56

This packages gives a white border to the image so removed it following this stackoverflow question – Abnormal 6/5, 2019 at 14:0

I've install it but got error: jpeg8.dll not found – Hove 29/5, 2019 at 11:20

I've pretty easily run out of memory doing this - anyone know of a way to just convert a single page (without loading the whole thing, then just using [0] or something)? – Sacellum 4/6, 2019 at 23:16

@Sacellum you can add first_page and last_page in argument of conver_from_path function to convert specified page only – Sew 5/6, 2019 at 9:57

Thanks for the heads up on those arguments, however I still get the same issue (I believe it's with memory, the traceback isn't helpful). I'm wondering if first_page / last_page still requires loading the full PDF into memory and then internally just parses out the required pages. – Sacellum 5/6, 2019 at 10:22

Is the '500' the dpi? Just wondering what your reason for going to 500 dpi would be, it looks like 300 is the standard. – Elyseelysee 25/7, 2019 at 1:9

@Jacob 500 is the dpi. It tradeoff on the resolution required and the computation available. In my experiments, 500 worked well most of the cases while 300 got me low rez images. – Sew 25/7, 2019 at 8:41

I used conda install -c conda-forge poppler to install poppler and it worked. – Destinydestitute 18/9, 2019 at 8:43

For converting the first page of the PDF and nothing else, this works:

from pdf2image import convert_from_path pages = convert_from_path('file.pdf', 500) pages = convert_from_path('file.pdf', 500, single_file=True) pages[0].save('file.jpg', 'JPEG')

– Orose 12/11, 2019 at 9:37

And there is a nice line in poppler docs: "You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path." thought in my case (conda install) it was actually C:\ProgramData\Anaconda3\pkgs\poppler-21.09.0-h24fffdf_1\Library\bin. – Mudra 31/10, 2021 at 21:21

If using mac, you can install both packages needed using conda conda install poppler conda install pdf2image – Mafaldamafeking 6/11, 2021 at 22:30

Get stuck on some pdf – Daria 11/4, 2022 at 9:19

Poppler's license is GPL based. Be careful in the commercial setting! – Unduly 10/9, 2022 at 18:46

This is probably the worst way if you are doing it for many pdfs. It stores images in ppms alongwith jpeg which itself are around 50 megabytes for each page of your pdf. It has a known issue of memory overhaul. – Buller 24/7, 2023 at 20:22

150

I found this simple solution, PyMuPDF, output to png file. Note the library is imported as "fitz", a historical name for the rendering engine it uses.

import fitz

pdffile = "infile.pdf"
doc = fitz.open(pdffile)
page = doc.load_page(0)  # number of page
pix = page.get_pixmap()
output = "outfile.png"
pix.save(output)
doc.close()

Note: The library changed from using "camelCase" to "snake_cased". If you run into an error that a function does not exist, have a look under deprecated names. The functions in the example above have been updated accordingly.

The fitz.Document class supports a context manager initialization:

with fitz.open(pdffile) as doc:
   ...

Hobbes answered 2/4, 2019 at 17:27 Comment(12)

Please add explanation to your answer. – Yet 2/4, 2019 at 17:31

A good library and it installs on Windows 10 without problems (no wheels required). github.com/pymupdf – Influent 23/1, 2020 at 9:27

This is the BEST answer. This was the only code that didn't require an additional installation onto my OS. Python scripts should focus on working within the Python system. I did not need to install poppler, pdftoppm, imageMagick or ghostscript, etc. (Python 3.6) – Kudos 4/2, 2020 at 22:11

Actually it requires another installation (fitz library, imported without even being referred to and its dependencies), this answer is incomplete (like all of the answers at this question) – Vallation 6/2, 2020 at 12:36

@TommasoGuerrini no. From the docs: "The standard Python import statement for this library is import fitz. This has a historical reason..." <fitz> is another library, something about neuroimaging. The code works as expected. – Hoard 18/2, 2020 at 8:49

@Hobbes Instead of pdf file taken from the path, can we take from pdfurl? Also, is it possible for the png file to be in-stream data rather than output-png file? – Boss 4/3, 2020 at 6:23

image = page.getPixmap(matrix=fitz.Matrix(150/72,150/72)) extracts the image at 150 DPI. Issue question on this topic. – Chrome 20/7, 2020 at 21:21

This solution uses code licensed commercially by Artifix Software, as well as open-source by AGPL licensing. Be wary of using this on your project, especially if it's commercial in nature. You may need to dig deeper into the legal implications. – Zoster 7/3, 2021 at 18:44

The perfect solution no dependency it needs . no poppler, no want nnothing else – Finance 2/4, 2022 at 12:16

for jpeg, I used pil_save instead of save – Truett 9/1, 2023 at 19:1

You saved my life! [tears of joy], tried almost a thousand libraries (wand, svglib, cairosvg, pdf2image, pdf2files, etc.) Each one needed another program to run, download exe on Windows, sudo on Linux, add to path... But this one is magic!!! you can even use page.get_pixmap(dpi=300) to get a 5921×1734 PNG file!!! I'm in love with this 💖. – Snowflake 7/3, 2023 at 7:34

Although code works, for some reason the extracted image is much lower quality than the original. I can't find anything obvious that would cause the quality degradation. – Rifle 6/4, 2023 at 2:11

Using pypdfium2 (v4):

python3 -m pip install "pypdfium2==4" pillow

import pypdfium2 as pdfium

# Load a document
pdf = pdfium.PdfDocument("tests/resources/multipage.pdf")

# Loop over pages and render
for i in range(len(pdf)):
    page = pdf[i]
    image = page.render(scale=4).to_pil()
    image.save(f"output_{i:03d}.jpg")

Advantages:

PDFium is liberal-licensed (BSD 3-Clause, Apache 2.0)
It is fast, outperforming Poppler. In terms of speed, pypdfium2 can almost reach PyMuPDF
Returns PIL.Image.Image, numpy.ndarray, or a ctypes array, depending on your needs
Is capable of processing encrypted (password-protected) PDFs
No mandatory runtime dependencies
Supports Python >= 3.6
Setup infrastructure complies with PEP 517/518

Wheels are currently available for

Windows amd64, win32, arm64
macOS x86_64, arm64
Linux (glibc) x86_64, i686, aarch64, armv7l
Linux (musl) x86_64, i686, aarch64

There is a script to build from source, too.

(Disclaimer: I'm the author)

Pinna answered 4/12, 2021 at 18:23 Comment(18)

This is the solution that worked best for me since it didn't require any other installation on python 3.9.13 and windows 10. You should add how to import pdfium in your reply: import pypdfium2 as pdfium – Australian 25/7, 2022 at 9:49

Added, thanks! I believe it initially was part of the post but might have got lost during an edit. (I updated this reply several times due to API changes.) – Pinna 25/7, 2022 at 10:46

@FrancescoPettini AFAIK, pymupdf doesn't require any external dependencies, either. Technically, it's yet a bit better than pypdfium2, so if you don't mind the AGPL, you could give that one a try, too. – Pinna 25/7, 2022 at 11:1

installing pymupdf via fitz required me to install frontend, which if I remember correctly required other packages too – Australian 25/7, 2022 at 11:6

@FrancescoPettini The docs say that pymupdf doesn't have any mandatory external runtime dependencies if installing from the binary wheels. – Pinna 27/7, 2022 at 12:4

Trying to use the multi-page render here nets me a "An attempt has been made to start a new process before the current process has finished its bootstrapping phase." – Harbaugh 4/8, 2022 at 23:52

@Harbaugh It looks like you may be calling the function in a special context where it is not possible to set up a new process pool. Consider using the single page renderer, or file a more detailed bug report on GitHub. – Pinna 16/8, 2022 at 18:7

This should be the accepted answer, thanks for your work. No need of any extra installation, pip install pypdfium2 is enough. – Reniti 25/8, 2022 at 13:9

This works great, but when using pyinstaller to create an exe, when I run the exe, it can't find "pdfium", which is referring to pypdfium2 (I checked the line that threw the error). Any idea as to how to fix this? – Unduly 9/9, 2022 at 19:25

@Unduly pypdfium2 contains a binary extension, and you need to configure pyinstaller to take that along. The pyinstaller docs provide information on how to do this. I never used pyinstaller myself but had a similar issue report once and the user was able to fix it somehow (github.com/pypdfium2-team/pypdfium2/issues/120). – Pinna 10/9, 2022 at 13:21

@Pinna Yes. I actually figured this out a few hours after I posted this... --collect-all pypdfium2 as a cmd line option should work, but I settled for --add-data "C:\Program Files\Python39\Lib\site-packages\pypdfium2\pdfium.dll";. (The "." at the end is intentional). – Unduly 10/9, 2022 at 17:23

@Unduly Great, thanks for letting me know! Actually, I'm intrigued why pyinstaller doesn't automatically include the binary. After all, the file wouldn't be in the package directory if it wasn't needed. – Pinna 10/9, 2022 at 18:33

@Pinna That's what I was saying!!! The only reason I stumbled into the answer was because I went searching into Lib\site-packages for "pdfium" because I wasn't importing any libraries called pdfium. I thought that if it was a dependency then I'd see it in site-packages. Low and behold it wasn't, so I thought I'd explore pypdfium2's folder... and what do you know... pdfium.dll. Soooo annoying. – Unduly 10/9, 2022 at 18:42

@Unduly Sorry for the inconveniences. I'm wondering if there's anything I can do to improve the situation. pypdfium2's setup code is a bit non-standard because setuptools extensions don't work for external binaries, they're only meant for in-place compilation. That's why we currently have to camouflage binaries as package data. Maybe pyinstaller would work correctly if it was an official extension, but I feel like package data should be included in any case... – Pinna 10/9, 2022 at 19:11

Oh, no - you're definitely right. That dll should've been included with pyinstaller - so I mean its not your fault. I can't think of a practical way on your end to alert users that are trying to include the library in pyinstaller (or of the likes) that they'd have to set that flag. I think the best you could get is to include it in the readme - but thats not worth IMO creating another branch in github. One more note. I noticed that when I open a pdf, and get a page with get_page() after pdf.close(), it doesn't close the page. So a move operation throws an error, because the page is in use. – Unduly 10/9, 2022 at 19:19

@Unduly Adding a note to the readme is a good idea, I'll do that. Concerning the problem you mention, I'm afraid I don't quite understand yet. Once you have called pdf.close(), no resources associated to that document handle may be accessed anymore, including loaded pages. Objects need to be closed in reverse order compared to loading (i. e. first the page, then the pdf). I'm not sure if I understood your problem correctly, though. If this information isn't sufficient, could you file a bug report on GitHub to elaborate? – Pinna 10/9, 2022 at 21:23

No, then I had false assumptions. I figured that when you close()d a PdfDocument() it'd kill its children, but maybe those instances aren't tied to the PdfDocument() at all? IDK, I remember trying to delete the document then the page, but I could've sworn that I did switched it and tried it the other way. But that's neither here nor there, since I got it working. Not that its a huge deal, but maybe in a future release, consider keeping the reference to the children and close() them when the parent pdf is close()d. Real quick, do you keep vector data on rendering a pdf or rasterize? – Unduly 11/9, 2022 at 3:51

May I add a snippet --- you may need to install PILLOW for pypdfium2 to work properly as described in the code example above. I certainly had to. – Huppert 7/8, 2023 at 4:41

The Python library pdf2image (used in the other answer) in fact doesn't do much more than just launching pdttoppm with subprocess.Popen, so here is a short version doing it directly:

PDFTOPPMPATH = r"D:\Documents\software\____PORTABLE\poppler-0.51\bin\pdftoppm.exe"
PDFFILE = "SKM_28718052212190.pdf"

import subprocess
subprocess.Popen('"%s" -png "%s" out' % (PDFTOPPMPATH, PDFFILE))

Here is the Windows installation link for pdftoppm (contained in a package named poppler): http://blog.alivate.com.au/poppler-windows/.

Coligny answered 22/5, 2018 at 21:33 Comment(1)

Hi, the Windows installation link for pdftoppm is just a buncho of zipped files, what do you have to do with them to make them work ? Thanks! – Treasure 27/8, 2018 at 11:5

There is no need to install Poppler on your OS. This will work:

pip install Wand

from wand.image import Image

f = "somefile.pdf"
with(Image(filename=f, resolution=120)) as source: 
    for i, image in enumerate(source.sequence):
        newfilename = f.removesuffix(".pdf") + str(i + 1) + '.jpeg'
        Image(image).save(filename=newfilename)

Belting answered 6/2, 2019 at 1:15 Comment(4)

ImageMagick library needs to be installed to work on wand. – Brackett 13/3, 2019 at 12:32

I tried this and needed to install Ghostscript as well (using Windows 10 and Python 3.7). Did it and it worked perfectly. – Isidore 1/7, 2019 at 7:55

whats the f[:-4] for? its not referenced anywhere else – Pescara 14/9, 2019 at 23:27

@Pescara f[:-4] will cut of ".pdf" from filename ( string slicing ) to create new filename with other ext. – Sub 1/11, 2019 at 19:10

@gaurwraith, install poppler for Windows and use pdftoppm.exe as follows:

Download zip file with Poppler's latest binaries/dlls from http://blog.alivate.com.au/poppler-windows/ and unzip to a new folder in your program files folder. For example: "C:\Program Files (x86)\Poppler".
Add "C:\Program Files (x86)\Poppler\poppler-0.68.0\bin" to your SYSTEM PATH environment variable.
From cmd line install pdf2image module -> "pip install pdf2image".
Or alternatively, directly execute pdftoppm.exe from your code using Python's subprocess module as explained by user Basj.

@vishvAs vAsuki, this code should generate the jpgs you want through the subprocess module for all pages of one or more pdfs in a given folder:

import os, subprocess

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

pdftoppm_path = r"C:\Program Files (x86)\Poppler\poppler-0.68.0\bin\pdftoppm.exe"

for pdf_file in os.listdir(pdf_dir):

    if pdf_file.endswith(".pdf"):

        subprocess.Popen('"%s" -jpeg %s out' % (pdftoppm_path, pdf_file))

Or using the pdf2image module:

import os
from pdf2image import convert_from_path

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

    for pdf_file in os.listdir(pdf_dir):

        if pdf_file.endswith(".pdf"):

            pages = convert_from_path(pdf_file, 300)
            pdf_file = pdf_file[:-4]

            for page in pages:

               page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")

Mirilla answered 24/11, 2018 at 22:46 Comment(2)

This helped a lot. Thanks! – Cantaloupe 22/10, 2019 at 10:23

This should actually be the accepted answer. Shows what to do with the installed binaries for Poppler – Gensmer 14/12, 2019 at 6:43

GhostScript performs much faster than Poppler for a Linux based system.

Following is the code for pdf to image conversion.

def get_image_page(pdf_file, out_file, page_num):
    page = str(page_num + 1)
    command = ["gs", "-q", "-dNOPAUSE", "-dBATCH", "-sDEVICE=png16m", "-r" + str(RESOLUTION), "-dPDFFitPage",
               "-sOutputFile=" + out_file, "-dFirstPage=" + page, "-dLastPage=" + page,
               pdf_file]
    f_null = open(os.devnull, 'w')
    subprocess.call(command, stdout=f_null, stderr=subprocess.STDOUT)

GhostScript can be installed on macOS using brew install ghostscript

Installation information for other platforms can be found here. If it is not already installed on your system.

Sew answered 7/1, 2020 at 12:29 Comment(2)

Just to let everyone know, Ghostscript is based on AGPL License and might need permissions in case used within commercial projects. For more reference, read ghostscript.com/license.html. – Wouldbe 6/7, 2021 at 18:27

How do you get to the conclusion that Ghostscript is "much faster" than Poppler? I can't reproduce this observation in my personal benchmarks. In fact, I found Ghostscript to be slightly slower. – Pinna 14/4, 2022 at 14:19

Their is a utility called pdftojpg which can be used to convert the pdf to img

You can found the code here https://github.com/pankajr141/pdf2jpg

from pdf2jpg import pdf2jpg
inputpath = r"D:\inputdir\pdf1.pdf"
outputpath = r"D:\outputdir"
# To convert single page
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1")
print(result)

# To convert multiple pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1,0,3")
print(result)

# to convert all pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="ALL")
print(result)

Extempore answered 30/7, 2018 at 15:17 Comment(2)

did this java thing just delete my whole folder full of pdf manipulating python scripts....? – Wiseacre 26/11, 2018 at 13:40

An alternative binding to Apache PDFBox is github.com/lebedov/python-pdfbox – Pinna 14/4, 2022 at 14:32

One problem everyone will face that is to Install Poppler. My way is a tricky way,but will work efficiently.

1st download Poppler here.

Then extract it and in the code section just add poppler_path=r'C:\Program Files\poppler-0.68.0\bin' (for eg.) like below

from pdf2image import convert_from_path
images = convert_from_path("mypdf.pdf", 500,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
for i, image in enumerate(images):
    fname = 'image'+str(i)+'.png'
    image.save(fname, "PNG")

Painless answered 10/12, 2020 at 14:19 Comment(1)

This will produce an image per page with the i argument. It works really well. Thank you! – Cleanly 8/1, 2021 at 15:45

Here is a function that does the conversion of a PDF file with one or multiple pages to a single merged JPEG image.

import os
import tempfile
from pdf2image import convert_from_path
from PIL import Image

def convert_pdf_to_image(file_path, output_path):
    # save temp image files in temp dir, delete them after we are finished
    with tempfile.TemporaryDirectory() as temp_dir:
        # convert pdf to multiple image
        images = convert_from_path(file_path, output_folder=temp_dir)
        # save images to temporary directory
        temp_images = []
        for i in range(len(images)):
            image_path = f'{temp_dir}/{i}.jpg'
            images[i].save(image_path, 'JPEG')
            temp_images.append(image_path)
        # read images into pillow.Image
        imgs = list(map(Image.open, temp_images))
    # find minimum width of images
    min_img_width = min(i.width for i in imgs)
    # find total height of all images
    total_height = 0
    for i, img in enumerate(imgs):
        total_height += imgs[i].height
    # create new image object with width and total height
    merged_image = Image.new(imgs[0].mode, (min_img_width, total_height))
    # paste images together one by one
    y = 0
    for img in imgs:
        merged_image.paste(img, (0, y))
        y += img.height
    # save merged image
    merged_image.save(output_path)
    return output_path

Example usage: -

convert_pdf_to_image("path_to_Pdf/1.pdf", "output_path/output.jpeg")

Russellrusset answered 17/1, 2021 at 8:17 Comment(1)

Just curious, why for i, img in enumerate(imgs): total_height += imgs[i].height instead of simply for img in imgs: total_height += img.height ? – Schlesien 5/7, 2021 at 9:55

I wrote this script to easily convert a folder directory that contains PDFs (single page) to PNGs really nicely.

import os
from pathlib import PurePath
import glob
# from PIL import Image
from pdf2image import convert_from_path
import pdb

# In[file list]

wd = os.getcwd()

# filter images
fileListpdf = glob.glob(f'{wd}//*.pdf')

# In[Convert pdf to images]

for i in fileListpdf:
    
    images = convert_from_path(i, dpi=300)
    
    path_split = PurePath(i).parts
    fileName, ext = os.path.splitext(path_split[-1])
    
    images[0].save(f'{fileName}.png', 'PNG')

Hopefully, this helps if you need to convert PDFs to PNGs!

Pettigrew answered 18/5, 2021 at 18:47 Comment(1)

unrelated, fwiw, you can also do pathlib.Path.cwd() – Saprophyte 19/6, 2023 at 1:53

I use a (maybe) much simpler option of pdf2image:

cd $dir
for f in *.pdf
do
  if [ -f "${f}" ]; then
    n=$(echo "$f" | cut -f1 -d'.')
    pdftoppm -scale-to 1440 -png $f $conv/$n
    rm $f
    mv  $conv/*.png $dir
  fi
done

This is a small part of a bash script in a loop for the use of a narrow casting device. Checks every 5 seconds on added pdf files (all) and processes them. This is for a demo device, at the end converting will be done at a remote server. Converting to .PNG now, but .JPG is possible too.

This converting, together with transitions on A4 format, displaying a video, two smooth scrolling texts and a logo (with transition in three versions) sets the Pi3 to allmost 4x 100% cpu-load ;-)

Generality answered 30/7, 2019 at 6:48 Comment(1)

The question is about rendering a PDF with Python, not bash. – Pinna 5/12, 2021 at 10:25

-1

from pdf2image import convert_from_path
import glob

pdf_dir = glob.glob(r'G:\personal\pdf\*')  #your pdf folder path
img_dir = "G:\\personal\\img\\"           #your dest img path

for pdf_ in pdf_dir:
    pages = convert_from_path(pdf_, 500)
    for page in pages:
        page.save(img_dir+pdf_.split("\\")[-1][:-3]+"jpg", 'JPEG')

Biak answered 23/5, 2019 at 7:7 Comment(2)

This would be a better answer if you explained how the code you provided answers the question. – Highpriced 15/9, 2019 at 0:39

@Highpriced Python is fairly readable, the comments do indicate the source folder and output folder, the rest reads like english. – Pescara 15/9, 2019 at 10:36

-1

Here is a solution which requires no additional libraries and is very fast. This was found from: https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html# I have added the code in a function to make it more convenient.

def convert(filepath):
    with open(filepath, "rb") as file:
        pdf = file.read()

    startmark = b"\xff\xd8"
    startfix = 0
    endmark = b"\xff\xd9"
    endfix = 2
    i = 0

    njpg = 0
    while True:
        istream = pdf.find(b"stream", i)
        if istream < 0:
            break
        istart = pdf.find(startmark, istream, istream + 20)
        if istart < 0:
            i = istream + 20
            continue
        iend = pdf.find(b"endstream", istart)
        if iend < 0:
            raise Exception("Didn't find end of stream!")
        iend = pdf.find(endmark, iend - 20)
        if iend < 0:
            raise Exception("Didn't find end of JPG!")

        istart += startfix
        iend += endfix
        jpg = pdf[istart:iend]
        newfile = "{}jpg".format(filepath[:-3])
        with open(newfile, "wb") as jpgfile:
            jpgfile.write(jpg)

        njpg += 1
        i = iend

        return newfile

Call convert with the pdf path as the argument and the function will create a .jpg file in the same directory

Magda answered 17/3, 2020 at 11:31 Comment(1)

This technique looks like it extracts images that have been embedded in the file, rather than rasterizing a page of the file as an image which is what the questioner wanted. – Snowball 20/3, 2020 at 16:43

-1

For a pdf file with multiple pages, the following is the best & simplest (I used pdf2image-1.14.0):

from pdf2image import convert_from_path
from pdf2image.exceptions import (
     PDFInfoNotInstalledError,
     PDFPageCountError,
     PDFSyntaxError
     )
        
images = convert_from_path(r"path/to/input/pdf/file", output_folder=r"path/to/output/folder", fmt="jpg",) #dpi=200, grayscale=True, size=(300,400), first_page=0, last_page=3)
        
images.clear()

Note:

"images" is a list of PIL images.
The saved images in the output folder will have system generated names; one can later change them, if required.

Strongarm answered 15/3, 2021 at 17:11 Comment(2)

Why is this "the best" ? – Feudatory 25/3, 2021 at 18:41

1) Fast as, no loop is required. 2) All the required parameters (like dpi, format, grayscale option, size etc.) are processed at one run. 3) Built-in exception handling is there. 4) The core function calling is only a single line statement. 5) You can get images as 'saved' files as well as a 'list' of 'matrices'. – Strongarm 26/3, 2021 at 12:34

-1

This easy script can convert a folder directory that contains PDFs (single/multiple pages) to jpeg.

from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
from os import listdir
from os import system
from os.path import isfile, join, basename, dirname
import shutil

def move_processed_file(file, doc_path, download_processed):
    try:
        shutil.move(doc_path + '/' + file, download_processed + '/' + file)
        pass
    except Exception as e:
        print(e.errno)
        raise
    else:
        pass
    finally:
        pass
    pass


def run_conversion():
    root_dir = os.path.abspath(os.curdir)

    doc_path = root_dir + r"\data\download"
    pdf_processed = root_dir + r"\data\download\pdf_processed"
    results_folder = doc_path

    files = [f for f in listdir(doc_path) if isfile(join(doc_path, f))]

    pdf_files = [f for f in listdir(doc_path) if isfile(join(doc_path, f)) and f.lower().endswith('.pdf')]

    # check OS type
    if os.name == 'nt':
        # if is windows or a graphical OS, change this poppler path with your own path
        poppler_path = r"C:\poppler-0.68.0\bin"
    else:
        poppler_path = root_dir + r"\usr\bin"

    for file in pdf_files:

        ''' 
        # Converting PDF to images 
        '''

        # Store all the pages of the PDF in a variable
        pages = convert_from_path(doc_path + '/' + file, 500, poppler_path=poppler_path)

        # Counter to store images of each page of PDF to image
        image_counter = 1

        filename, file_extension = os.path.splitext(file)

        # Iterate through all the pages stored above
        for page in pages:
            # Declaring filename for each page of PDF as JPG
            # PDF page n -> page_n.jpg
            filename = filename + '_' + str(image_counter) + ".jpg"

            # Save the image of the page in system
            page.save(results_folder + '/' + filename, 'JPEG')

            # Increment the counter to update filename
            image_counter += 1

        move_processed_file(file, doc_path, pdf_processed)

Ecdysis answered 12/4, 2022 at 9:57 Comment(0)

-1

Following pdf2image documentation in 2024. Just remember to install poppler

convert_from_path returns a list with all the pages of the pdf converted to .ppm, then define the file name and save the first page defined in image_list[0] as JPEG. If you want to save all pages, just iterate over image_list

import os
from pdf2image import convert_from_path

pdf_folder = 'path/to/pdfs'
img_folder = 'path/to/save/imgs'

for file in os.listdir(pdf_folder):
    if file.endswith('.pdf'):
        pdf_path = os.path.join(pdf_folder, file)

        with open(pdf_path, 'rb') as pdf_arquivo:
            name = os.path.splitext(file)[0]            
            image_list = convert_from_path(pdf_path, poppler_path='C:/Poppler/bin')
            img_path = os.path.join(img_folder, f'{name}.jpg')
            image_list[0].save(img_path, 'JPEG')

print("Finished!")

Perot answered 12/3 at 17:56 Comment(0)

-3

from pdf2image import convert_from_path

PDF_file = 'Statement.pdf'
pages = convert_from_path(PDF_file, 500,userpw='XXX')

image_counter = 1

for page in pages:

    filename = "foldername/page_" + str(image_counter) + ".jpg"
    page.save(filename, 'JPEG')
    image_counter = image_counter + 1

Hall answered 14/4, 2021 at 5:36 Comment(1)

Posting a poorly formatted, incorrectly indented answer with no explanation as to how your answer works or what benefits it offers compared to the 13 existing answers, is of very little value as it stands. Please edit your answer, fix the formatting (the formatting help can assist you), fix the indentation, and add some explanation. – Confucianism 14/4, 2021 at 6:15

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags