EOF marker not found while use PyPDF2 merge pdf file in python
Asked Answered
T

9

16

When I use the following code

from PyPDF2 import PdfFileMerger

merge = PdfFileMerger()

for newFile in nlst:
    merge.append(newFile)
merge.write("newFile.pdf")

Something happened as following:

raise utils.PdfReadError("EOF marker not found")

PyPDF2.utils.PdfReadError: EOF marker not found

Anybody could tell me what happened?

Taynatayra answered 29/7, 2017 at 14:50 Comment(1)
This error could occur by a the lectures of a file that is not a pdf. Be careful when use the "for in", and print the errors to notice what is happening.Diffractive
D
10

After encountering this problem using camelot and PyPDF2, I did some digging and have solved the problem.

The end of file marker '%%EOF' is meant to be the very last line, but some PDF files put a huge chunk of javascript after this line, and the reader cannot find the EOF.

Illustration of what the EOF plus javascript looks like if you open it:

 b'>>\r\n',
 b'startxref\r\n',
 b'275824\r\n',
 b'%%EOF\r\n',
 b'\n',
 b'\n',
 b'<script type="text/javascript">\n',
 b'\twindow.parent.focus();\n',
 b'</script><!DOCTYPE html>\n',
 b'\n',
 b'\n',
 b'\n',

So you just need to truncate the file before the javascript begins.

Solution:

def reset_eof_of_pdf_return_stream(pdf_stream_in:list):
    # find the line position of the EOF
    for i, x in enumerate(txt[::-1]):
        if b'%%EOF' in x:
            actual_line = len(pdf_stream_in)-i
            print(f'EOF found at line position {-i} = actual {actual_line}, with value {x}')
            break

    # return the list up to that point
    return pdf_stream_in[:actual_line]

# opens the file for reading
with open('data/XXX.pdf', 'rb') as p:
    txt = (p.readlines())

# get the new list terminating correctly
txtx = reset_eof_of_pdf_return_stream(txt)

# write to new pdf
with open('data/XXX_fixed.pdf', 'wb') as f:
    f.writelines(txtx)

fixed_pdf = PyPDF2.PdfFileReader('data/XXX_fixed.pdf')
Dragoman answered 5/2, 2021 at 6:11 Comment(0)
E
8

PDF is a file format, where a pdf parser normally starts reading the file by reading some global information located at the end of the file. At the very end of the document there needs to be a line with the content of

%%EOF

This is a marker, where the pdf parser knows, that the PDF document ends here and the global information it needs, should be before this (a startxref section).

I guess, that the error message you see, means, that one of the input documents was truncated and is missing this %%EOF-marker.

Eris answered 31/7, 2017 at 11:53 Comment(0)
T
3

One simple solution for this problem (EOF marker not found). Open your .pdf file in other application (I used Libre office draw in Ubuntu 18.04). Then export the file as .pdf. Using this exported .pdf file the problem will not persist.

Titleholder answered 27/3, 2020 at 2:7 Comment(0)
A
3

PyPDF2 cannot find the EOF marker in a PDF that is encrypted.

I came across the same error while I was working through the (excellent) Automate The Boring Stuff. Chapter 15, 2nd edition, page 355, project Combining Select Pages from Many PDFs.

I chose to combine all the PDFs I had made during this chapter into one document and one of them was an encrypted PDF and the project failed when it got to the end of the encrypted document with the error message:

PyPDF2.utils.PdfReadError: EOF marker not found

I moved the encrypted file to a different folder (so it would not be merged with the other pdfs) and the project worked fine.

So, it seems PyPDF2 cannot find the EOF marker in a PDF that is encrypted.

Artieartifact answered 24/12, 2022 at 11:36 Comment(0)
V
0

I've also got that problem and got a solution.

First, python reads PDF as 'rb' or 'wb' as a binary read and write format.

END OF FILE

Occurs when that there was an open parenthesis somewhere on a line, but not a matching closing parenthesis. Python reached the end of the file while looking for the closing parenthesis.

Here is the 1 solution:

  1. Close that file that you've opened earlier using this command

    newfile.close()

  2. Check whether that pdf is opened using other variable and again close it

    Same_file_with_another_variable.close()

Now open it only once and use it , you are good to go.

Vortical answered 30/6, 2019 at 12:59 Comment(1)
You should mention what 'newfile' object is. Relate it from the orignal question.Sybil
S
0

I wanted to add my hacky solution to this issue.

I had the same error with python requests (application/pdf). In my case the provider (a shipping labeling service) did give a 200 and a b'string which represents the PDF, but in some random cases it missed the EOF marker.

Because it was random, I came up with the following solution:

for obj in label_objects:
    get_label = api.get_label(label_id=obj.label_id)
    while not 'EOF' in str(get_label.content):
        get_label = api.get_label(label_id=obj.label_id)

At a few tries it gives the b'string with EOF and we're good to proceed.

Stickybeak answered 12/6, 2022 at 17:38 Comment(0)
L
0

please use this code:

response = requests.get("your link")
pdf_io_bytes = io.BytesIO(response.content)
text_list = []
pdf = pypdf.PdfReader(pdf_io_bytes)

num_pages = len(pdf.pages)

for page in range(num_pages):
    page_text = pdf.pages[page].extract_text()
    text_list.append(page_text)
text = "\n".join(text_list)
Lait answered 12/10, 2023 at 19:45 Comment(0)
W
0

Adding one more solution that worked around this issue for me - if the input files you are seeking to merge into a PDF are images, you could use a PIL function to merge as PDF via the append parameter while saving:

from PIL import Image  

images = [
    # the pdf page order will follow this list
    Image.open( pth ).convert('RGBA')
    for pth in jpg_pths
    ]

pth_pdf_ou = "my_merged.pdf"

images[0].save(
    pth_pdf_ou, 
    "PDF",
    resolution=100.0, 
    save_all=True, 
    append_images=images[1:]
    )
Wodge answered 16/8 at 5:52 Comment(0)
A
-1

i had the same problem. For me the solution was to close the previously opened file before working with it again.

Ambert answered 27/12, 2022 at 20:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.