How to "write to variable" instead of "to file" in Python
Asked Answered
A

3

2

I'm trying to write a function which splits a pdf into separate pages. From this SO answer. I copied a simple function which splits a pdf into separate pages:

def splitPdf(file_):
    pdf = PdfFileReader(file_)
    pages = []
    for i in range(pdf.getNumPages()):
        output = PdfFileWriter()
        output.addPage(pdf.getPage(i))
        with open("document-page%s.pdf" % i, "wb") as outputStream:
            output.write(outputStream)
    return pages

This however, writes the new PDFs to file, instead of returning a list of the new PDFs as file variables. So I changed the line of output.write(outputStream) to:

pages.append(outputStream)

When trying to write the elements in the pages list however, I get a ValueError: I/O operation on closed file.

Does anybody know how I can add the new files to the list and return them, instead of writing them to file? All tips are welcome!

Ambrogino answered 23/10, 2014 at 13:31 Comment(6)
Have you tried reading the data, rather than storing the file handle - pages.append(outputStream.read())?Nuno
Have you tried using cStringIO.StringIO to open outputStream?Jaquesdalcroze
what the user above said... you can usually substitute a StringIO object for a file and get the result out as a string that wayGalactic
@Nuno - I just tried it, and that gives me a IOError: File not open for reading on the line saying pages.append(outputStream.read()). Any other ideas?Ambrogino
@Jaquesdalcroze - Ehm, no I haven't tried StringIO. Any tips on how to do that? A code example would be very welcome.. :)Ambrogino
What is the use case. You want to have a list of file handles to operate on after you called splitPdf? Can`t you just have a list of path instead?Ronnaronnholm
J
6

It is not completely clear what you mean by "list of PDFs as file variables. If you want to create strings instead of files with PDF contents, and return a list of such strings, replace open() with StringIO and call getvalue() to obtain the contents:

import cStringIO

def splitPdf(file_):
    pdf = PdfFileReader(file_)
    pages = []
    for i in range(pdf.getNumPages()):
        output = PdfFileWriter()
        output.addPage(pdf.getPage(i))
        io = cStringIO.StringIO()
        output.write(io)
        pages.append(io.getvalue())
    return pages
Jaquesdalcroze answered 23/10, 2014 at 14:36 Comment(2)
(This answer is Python 2 only)Tabulator
@Tabulator It should be quite straightforward to adapt to Python 3, though.Jaquesdalcroze
I
7

You can use the in-memory binary streams in the io module. This will store the pdf files in your memory.

import io

def splitPdf(file_):
    pdf = PdfFileReader(file_)
    pages = []
    for i in range(pdf.getNumPages()):
        outputStream = io.BytesIO()

        output = PdfFileWriter()
        output.addPage(pdf.getPage(i))
        output.write(outputStream)

        # Move the stream position to the beginning,
        # making it easier for other code to read
        outputStream.seek(0)

        pages.append(outputStream)
    return pages

To later write the objects to a file, use shutil.copyfileobj:

import shutil

with open('page0.pdf', 'wb') as out:
    shutil.copyfileobj(pages[0], out)
Inquiline answered 23/10, 2014 at 14:7 Comment(0)
J
6

It is not completely clear what you mean by "list of PDFs as file variables. If you want to create strings instead of files with PDF contents, and return a list of such strings, replace open() with StringIO and call getvalue() to obtain the contents:

import cStringIO

def splitPdf(file_):
    pdf = PdfFileReader(file_)
    pages = []
    for i in range(pdf.getNumPages()):
        output = PdfFileWriter()
        output.addPage(pdf.getPage(i))
        io = cStringIO.StringIO()
        output.write(io)
        pages.append(io.getvalue())
    return pages
Jaquesdalcroze answered 23/10, 2014 at 14:36 Comment(2)
(This answer is Python 2 only)Tabulator
@Tabulator It should be quite straightforward to adapt to Python 3, though.Jaquesdalcroze
D
1

Haven't used PdfFileWriter, but think that this should work.

def splitPdf(file_):
    pdf = PdfFileReader(file_)
    pages = []
    for i in range(pdf.getNumPages()):
        output = PdfFileWriter()
        output.addPage(pdf.getPage(i))
        pages.append(output)
    return pages

def writePdf(pages):
    i = 1
    for p in pages:
        with open("document-page%s.pdf" % i, "wb") as outputStream:
            p.write(outputStream)
        i += 1
Dylandylana answered 23/10, 2014 at 14:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.