In order to get a single string from a multi-paged PDF I'm doing this:
import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
output = page.extractText()
output
The result is a string from a single page (the last page in the document) - just as it should be according to the PyPDF2 documentation. I applied this method because I've read some people suggesting it to read whole PDF, what does not work in my case.
Obviously, this is a basic operation, and I apologize in advance for my lack of experience. I tried other solutions like Tika, PDFMiner and Textract, but PyPDF seems to be the only one letting me so far.
Any help would be appreciated.
Update:
As suggested, I defined an output
as a list and then appended to it (as I thought) all pages in a loop like this:
for i in range(count):
page = pdfReader.getPage(i)
output = []
output.append(page.extractText())
The result, thought, is a single string in the list like ['sample content from the last page of PDF']
output
every time? – Friscoconcat str to bytes
– Grimesoutput += page.extractText()
to avoid overwriting, as suggested below, I getTypeError: can't concat str to bytes
– Grimesoutput
? In any case, what I had in mind was using something like a list. – FriscogetPage
takes a single page number as an argument? – Grimesoutput
as a list and then append the result ofpage.extractText()
where you're currently assigning it tooutput
. – Frisco['sample content from the last page of PDF']
. How can I loop over the whole range of pages? I posted that piece of code in the question update. – Grimes