How to retrieve ALL pages from PDF as a single string in Python 3 using PyPDF2
Asked Answered
G

4

5

In order to get a single string from a multi-paged PDF I'm doing this:

import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    output = page.extractText()
output

The result is a string from a single page (the last page in the document) - just as it should be according to the PyPDF2 documentation. I applied this method because I've read some people suggesting it to read whole PDF, what does not work in my case.

Obviously, this is a basic operation, and I apologize in advance for my lack of experience. I tried other solutions like Tika, PDFMiner and Textract, but PyPDF seems to be the only one letting me so far.

Any help would be appreciated.

Update:

As suggested, I defined an output as a list and then appended to it (as I thought) all pages in a loop like this:

for i in range(count):
    page = pdfReader.getPage(i)
    output = []
    output.append(page.extractText())

The result, thought, is a single string in the list like ['sample content from the last page of PDF']

Grimes answered 13/2, 2020 at 1:3 Comment(12)
Aren't you overwriting output every time?Frisco
@Frisco I guess... But it's impossible to concat str to bytesGrimes
I'm not sure I understand how that relates to my question, sorry.Frisco
@Frisco If I use output += page.extractText() to avoid overwriting, as suggested below, I get TypeError: can't concat str to bytesGrimes
How do you define output? In any case, what I had in mind was using something like a list.Frisco
@Frisco As a string. Sorry, I don't quite understand. You mean to get an output as a list of strings retrieved from each page? How to get such a list if getPage takes a single page number as an argument?Grimes
As a string. Then that explains the error, right? Sorry, I don't quite understand. You mean to get an output as a list of strings retrieved from each page? How to get such a list if getPage takes a single page number as an argument? All I meant is that could define output as a list and then append the result of page.extractText() where you're currently assigning it to output.Frisco
@Frisco Thank you, but it creates list with a single string like ['sample content from the last page of PDF']. How can I loop over the whole range of pages? I posted that piece of code in the question update.Grimes
Look at where you defined the list, it’s a similar issue to the first one.Frisco
Do you want me to post an answer?Frisco
@Frisco Sure! Certainly I'm not the only beginner who does not know how to loop properly =)Grimes
Done! Let me know if you want me to expand on any area.Frisco
L
6

Could it be because of this line:

output = page.extractText()

Try this instead:

output += page.extractText()

Because in your code, you're overwriting the value of the "output" variable instead of appending to it. Don't forget to declare the "output" variable before the for loop. So output = '' before for i in range(count):

Litre answered 13/2, 2020 at 1:14 Comment(2)
Thank you! Apparently, yes. TypeError: can't concat str to bytes This is an error I get. As I understand, this is because I take 'rb' as an argument for 'open'. But then PdfFileReader stream/file object is not in binary mode Is there an option to convert bytes to string some other way?Grimes
What are you trying to do? To write the output to a text file: with open('sample.txt', 'w') as f: f.writelines(output) Don't forget to declare the "output" variable before the for loop. So output = '' before for i in range(count):Litre
T
4

This code works:

import os, glob, PyPDF2, sys

file_path = 'C:/Users/ipeter/Desktop/Webdriverdownloads'
read_files = glob.glob(os.path.join(file_path,'*.pdf'))

for files in read_files:
    pdfReader = PyPDF2.PdfFileReader(files)
    count = pdfReader.numPages
    output = []
    for i in range(count):
        page = pdfReader.getPage(i)
        output.append(page.extractText())
    print(output)

The first loop reads all files in a folder. The second loop reads all pages in the pdf.

output[0] = pdfpage1
output[1] = pdfpage2
output[2] = pdfpage3

... etc

If you need entire pdf in one string you can save newoutput use join function:

seperator = ','
newoutput = seperator.join(output)

or simplify:

newoutput = ','.join(output)
Trifid answered 14/2, 2020 at 2:57 Comment(1)
@Grimes Hope this helpsTrifid
F
3

You're overwriting the output variable each time.

While you could concatenate the bytes together using output +=, it's probably safer to use a list instead, in which case you would have output = [] defined outside the loop, and replace output = page.extractText() with output.append(page.extractTest()).

Frisco answered 13/2, 2020 at 20:44 Comment(0)
R
1

Try to create output as empty string first..

output = ""
for i in range(pdfReader.numPages):
    pageObj = pdfReader.getPage(i)
    output += pageObj.extractText()
Ratty answered 10/9, 2021 at 8:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.