How to solve MemoryError using Python 3.7 pdf2image library?
Asked Answered
L

6

17

I'm running a simple PDF to image conversion using Python PDF2Image library. I can certainly understand that the max memory threshold is being crossed by this library to arrive at this error. But, the PDF is 6.6 MB (approx), then why would it take up GBs of memory to throw a memory error?

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pdf2image import convert_from_path
>>> pages = convert_from_path(r'C:\Users\aakashba598\Documents\pwc-annual-report-2017-2018.pdf', 200)
Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 917, in _bootstrap_inner
    self.run()
  File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\subprocess.py", line 1215, in _readerthread
    buffer.append(fh.read())
MemoryError

Also, what is the possible solution to this?

Update: When I reduced the dpi parameter from the convert_from_path function, it works like a charm. But the pictures produced are low quality (for obvious reasons). Is there a way to fix this? Like batch by batch creation of images and clearing memory everytime. If there is a way, how to go about it?

Labdanum answered 6/6, 2019 at 6:8 Comment(2)
Do you have to use Python, or can you also use imagemagick?Fallfish
I want to do it through coding and Python is a very handy programming language.Labdanum
B
34

Convert the PDF in chunks of 10 pages at a time (1-10, 11-20... and so on)

from pdf2image import pdfinfo_from_path, convert_from_path
info = pdfinfo_from_path(pdf_file, userpw=None, poppler_path=None)

maxPages = info["Pages"]
for page in range(1, maxPages+1, 10) : 
    convert_from_path(pdf_file, dpi=200, first_page=page, last_page = min(page+10-1,maxPages))
Briquette answered 6/6, 2019 at 6:23 Comment(4)
Very short, crisp and a brilliant solution. Thank you!Labdanum
I got, 'pdf2image' has no attribute '_page_count'. Any idea what this is about?Swaggering
pdf2image._page_count is undocumented function of the module. maybe it was removed or renamed.Briquette
Try from pdf2image.pdf2image import pdfinfo_from_path then pdfinfo_from_path(pdf_file, userpw=None, poppler_path=None)["Pages"]Tutelary
N
13

I am a bit late to this, but the problem is indeed related to the 136 pages going into memory. You can do three things.

  1. Specify a format for the converted images.

By default, pdf2image uses PPM as its image format, it is faster, but also takes a lot more memory (over 30MB per image!). What you can do to fix this is use a more memory-friendly format like jpeg or png.

convert_from_path('C:\path\to\your\pdf', fmt='jpeg')

That will probably solve the problem, but it's mostly just because of the compression, and at some point (say for +500pages PDF) the problem will reappear.

  1. Use an output directory

This is the one I would recommend because it allows you to process any PDF. The example on the README page explains it well:

import tempfile

with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('C:\path\to\your\pdf', output_folder=path)

This writes the image to your computer storage temporarily so you don't have to delete it manually. Make sure to do any processing you need to do before exiting the with context though!

  1. Process the PDF file in chunks

pdf2image allows you to define the first an last page that you want to process. That means that in your case, with a PDF of 136 pages, you could do:

for i in range(0, 136 // 10 + 1):
    convert_from_path('C:\path\to\your\pdf', first_page=i*10, last_page=(i+1)*10)
Newsstand answered 6/6, 2019 at 18:57 Comment(5)
Concerning processing the PDF in chunks: in latest version of convert_from_path there's no first and last, instead it's first_page and last_pageFeuar
@EugeneChabanov it's always been first_page and last_page, I just missed it when I first wrote the answer. I'll update it.Newsstand
Does option #2 just write the images to storage instead of keeping them in memory? And if so, is the only advantage of using tempfile that it is garbage collected, or is there better performance for some reason?Chatty
Also, it seems from the documentation that in convert_from_path the first_page and last_page are both inclusive, which I also confirmed through use, so it should be last_page = (i+1)*10 - 1.Chatty
Actually, from using convert_from_path, it seems that the pages are not zero indexed, so that the first page of the PDF is page 1, so it should be first_page=i*10 + 1 and last_page=(i+1)*10.Chatty
K
7

The accepted answer has a small issue.

maxPages = pdf2image._page_count(pdf_file)

can no longer be used, as _page_count is deprecated. I found the working solution for the same.

from PyPDF2 import PdfFileWriter, PdfFileReader    
inputpdf = PdfFileReader(open(pdf, "rb"))
maxPages = inputpdf.numPages
for page in range(1, maxPages, 100):
    pil_images = pdf2image.convert_from_path(pdf, dpi=200, first_page=page,
                                                     last_page=min(page + 100 - 1, maxPages), fmt= 'jpg',
                                                     thread_count=1, userpw=None,
                                                     use_cropbox=False, strict=False)

This way, however large the file, it will process 100 at once and the ram usage is always minimal.

Kokura answered 16/9, 2019 at 23:19 Comment(0)
H
1

A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder) https://github.com/Belval/pdf2image i guess will help you to understand.

Solution: Break the pdf in small parts and convert it into image. The image could be merge...

 from PyPDF2 import PdfFileWriter, PdfFileReader

 inputpdf = PdfFileReader(open("document.pdf", "rb"))

 for i in range(inputpdf.numPages):
     output = PdfFileWriter()
     output.addPage(inputpdf.getPage(i))
     with open("document-page%s.pdf" % i, "wb") as outputStream:
         output.write(outputStream)

split a multi-page pdf file into multiple pdf files with python?

 import numpy as np
 import PIL

 list_im = ['Test1.jpg', 'Test2.jpg', 'Test3.jpg']
 imgs    = [ PIL.Image.open(i) for i in list_im ]
 # pick the image which is the smallest, and resize the others to match it (can be   arbitrary image shape here)
 min_shape = sorted( [(np.sum(i.size), i.size ) for i in imgs])[0][1]
 imgs_comb = np.hstack( (np.asarray( i.resize(min_shape) ) for i in imgs ) )

 # save that beautiful picture
 imgs_comb = PIL.Image.fromarray( imgs_comb)
 imgs_comb.save( 'Trifecta.jpg' )    

 # for a vertical stacking it is simple: use vstack
 imgs_comb = np.vstack( (np.asarray( i.resize(min_shape) ) for i in imgs ) )
 imgs_comb = PIL.Image.fromarray( imgs_comb)
 imgs_comb.save( 'Trifecta_vertical.jpg' )

refer:Combine several images horizontally with Python

Heighttopaper answered 6/6, 2019 at 6:26 Comment(0)
S
1

eventually, combining these techniques, I ended up coding like following, given the goal to convert a pdf into a pptx with avoiding memory overflow and good speed in mind:

import os, sys, tempfile, pprint
from PIL import Image
from pdf2image import pdfinfo_from_path,convert_from_path
from pptx import Presentation
from pptx.util import Inches
from io import BytesIO

pdf_file = sys.argv[1]
print("Converting file: " + pdf_file)

# Prep presentation
prs = Presentation()
blank_slide_layout = prs.slide_layouts[6]

# Create working folder
base_name = pdf_file.split(".pdf")[0]

# Convert PDF to list of images
print("Starting conversion...")
print()
path: str = "C:/ppttemp"  #temp dir (use cron to delete files older than 1h hourly)
slideimgs = []
info = pdfinfo_from_path(pdf_file, userpw=None, poppler_path='C:/Program Files/poppler-0.90.1/bin/')
maxPages = info["Pages"]
for page in range(1, maxPages+1, 5) : 
   slideimgs.extend( convert_from_path(pdf_file, dpi=250, output_folder=path, first_page=page, last_page = min(page+5-1,maxPages), fmt='jpeg', thread_count=4, poppler_path='C:/Program Files/poppler-0.90.1/bin/', use_pdftocairo=True)   )

print("...complete.")
print()

# Loop over slides
for i, slideimg in enumerate(slideimgs):
    if i % 5 == 0:
        print("Saving slide: " + str(i))

    imagefile = BytesIO()
    slideimg.save(imagefile, format='jpeg')
    imagedata = imagefile.getvalue()
    imagefile.seek(0)
    width, height = slideimg.size

    # Set slide dimensions
    prs.slide_height = height * 9525
    prs.slide_width = width * 9525

    # Add slide
    slide = prs.slides.add_slide(blank_slide_layout)
    pic = slide.shapes.add_picture(imagefile, 0, 0, width=width * 9525, height=height * 9525)
    

# Save Powerpoint
print("Saving file: " + base_name + ".pptx")
prs.save(base_name + '.pptx')
print("Conversion complete. :)")
print()
Saponin answered 5/11, 2020 at 21:21 Comment(0)
E
1

This code converts a PDF in chunks and then adds the images to an array:

from pdf2image import pdfinfo_from_path, convert_from_path

PDF = "/path/to/pdf.pdf"
CHUNK_SIZE = 20 # depends on your RAM
MAX_PAGES = pdfinfo_from_path(PDF)["Pages"]

images = []
for page in range(1, MAX_PAGES, CHUNK_SIZE):
    images += convert_from_path(PDF, first_page=page, last_page=page + CHUNK_SIZE - 1)
Eden answered 15/4, 2023 at 14:40 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.