After encountering this problem using camelot
and PyPDF2
, I did some digging and have solved the problem.
The end of file marker '%%EOF'
is meant to be the very last line, but some PDF files put a huge chunk of javascript after this line, and the reader cannot find the EOF.
Illustration of what the EOF plus javascript looks like if you open it:
b'>>\r\n',
b'startxref\r\n',
b'275824\r\n',
b'%%EOF\r\n',
b'\n',
b'\n',
b'<script type="text/javascript">\n',
b'\twindow.parent.focus();\n',
b'</script><!DOCTYPE html>\n',
b'\n',
b'\n',
b'\n',
So you just need to truncate the file before the javascript begins.
Solution:
def reset_eof_of_pdf_return_stream(pdf_stream_in:list):
# find the line position of the EOF
for i, x in enumerate(txt[::-1]):
if b'%%EOF' in x:
actual_line = len(pdf_stream_in)-i
print(f'EOF found at line position {-i} = actual {actual_line}, with value {x}')
break
# return the list up to that point
return pdf_stream_in[:actual_line]
# opens the file for reading
with open('data/XXX.pdf', 'rb') as p:
txt = (p.readlines())
# get the new list terminating correctly
txtx = reset_eof_of_pdf_return_stream(txt)
# write to new pdf
with open('data/XXX_fixed.pdf', 'wb') as f:
f.writelines(txtx)
fixed_pdf = PyPDF2.PdfFileReader('data/XXX_fixed.pdf')