Debugging PDF for error
Asked Answered
S

3

6

I'm creating PDF files using PDFClown java library.

Sometimes, when openning these files with Adobe Acrobat Reader I get the famous error message:

"An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem."

The error shows while reading (with Adobe) the attached file only when scrolling down to the 8'th page, then scrolling back up to 3'td page. Alternatively, Zooming out to 33.3% will also produce the message.

Just for the record, Foxit reader reads the file flawlessly, as well as other PDF readers like browsers.

My questions are:

  1. What's wrong with my file?? (file is attached)

  2. How can I find what's wrong with it? is there a tool which tells you where does the error lie?

Thanks!

Szombathely answered 15/9, 2013 at 13:13 Comment(4)
Adobe Acrobat has some profiling profiles that can help there.Fino
I tried checking it with preflight, and for each check it gave me "An error occurred while parsing a content stream. Unable to analyze the PDF file.". Please help...Szombathely
Adobe Acrobat 9.5 Preflight fails on this document... ;)Zacatecas
Same problem here and Preflight fails in my case too... :( So, I guess there is no tool that really tells you where the error is... Well done Adobe. Useless as always...Accolade
S
5

Ok, this wasn't easy -

Due to a bug in PDFClown the my main stream of information in the PDF page has been corrupted. After it's end it had a copy of a past instance of it. This caused a partial text section without the starting command "BT" - which left a single "ET" without a "BT" in the end of the stream.

once I corrected this, it ran great.

Thank you all for your help. I would have much more difficult time debugging it without the tool RUPS which @Bruno suggested.

edit:

The bug was in the Buffer.java:clone() (line 217)

instead of line:

clone.append(data);

needs to be:

clone.append(data, 0, this.length);

Without this correction it clones the whole data buffer, and set the cloned Buffer's length to the data[].length. This is very problematic if the Buffer.length is smaller than the data[].length. The result in my case was that in the end of the stream there was garbage.

Szombathely answered 20/9, 2013 at 19:53 Comment(3)
(sorry for my late comment, I'm PDF Clown's author) It'd be helpful if you indicated the actual code which caused your issue to happen, so that a constraint may be possibly imposed to avoid it, thanks.Sulphathiazole
@StefanoChizzolini, I sent you mail with the solution at the time. Anyway, I edited the answer so it will include the fix.Szombathely
You are absolutely right, it was my fault! I have just retrieved your mail dated Fri, September 20, 2013 10:14 pm: during that period I was taking a hiatus from the project, so I overlooked it, I'm really sorry. Nonetheless, it's always a good thing to post your solution in the first place, as it may benefit any other user. I'm going to include it in the next release of PDF Clown (0.2.0). thank you very much!Sulphathiazole
Z
4

The error shows while reading (with Adobe) the attached file only when scrolling down to the 8'th page, then scrolling back up to 3'td page. Alternatively, Zooming out to 33.3% will also produce the message.

Well, I get it easier, I merely open the PDF and scroll down using the cursor keys. As soon as the top 2 cm of page 3 appear, the message appears.

What's wrong with my file??

The content of pages 1 and 2 look ok, so let's look at the content of page 3.

My initial attributing the issue to the use of text specific operations (especially Tf and Tw) outside of a text object was wrong as Stefano Chizzolini pointed out: Some text related operations indeed are allowed outside text objects, namely the text state operations, cf. figure 9 from the PDF specification:

Graphics Objects

So while being less common, text state operations at page description level are completely ok.

After my incorrect attempt to explain the issue, the OP's own answer indicated that the

main stream of information in the PDF page has been corrupted. After it's end it had a copy of a past instance of it. This caused a partial text section without the starting command "BT" - which left a single "ET" without a "BT" in the end of the stream.

An ET without a prior BT indeed would be an error, and quite likely it would be accompanied by operations at the wrong level... Inspecting the stream content of that third page (the focused page of this issue), though, I could not find any unmatched ET. In the course of that inspection, though, I discovered that the content stream contains more than 2000 trailing 0 bytes! Adobe Reader seems not to be able to cope with these 0 bytes.

The bug the OP found, can explain the issue:

in the Buffer.java:clone() (line 217)

instead of line:

clone.append(data);

needs to be:

clone.append(data, 0, this.length);

Without this correction it clones the whole data buffer, and set the cloned Buffer's length to the data[].length. This is very problematic if the Buffer.length`` is smaller than the data[].length.

Trailing 0 bytes can be an effect of such a buffer copying bug.

Furthermore symptoms as found by the OP (After it's end it had a copy of a past instance of it) can also be the effect of such a bug. So I assume the OP found those symptoms on a different page, not page 3, but fixing the bug healed all symptoms.

How can I find what's wrong with it? is there a tool which tells you where does the error lie?

There are PDF syntax checkers, e.g. the Preflight tool included in Adobe Acrobat. but even that fails on your file.

So essentially you have to extract the page content (using a PDF browser, e.g. RUPS) and check manually with the PDF specification on the other screen.

Zacatecas answered 16/9, 2013 at 14:21 Comment(18)
@Bruno, I thank you so much for your efforts to help me!! I'm trying to study this a bit to understand everything yo said :). When I'll understand what was the cause of the problem I'll post..Szombathely
It was mkl who helped you. I upvoted his answer and added a link to a short blog post about RUPS. I don't (want to) know PdfClown, but what mkl is telling you, is that you're creating PDF syntax that is illegal according to ISO-32000-1. You (or PdfClown) are mixing text operators and graphics operators, breaking the rules of the specification, resulting in a PDF that is so broken that even Acrobat can't fix it.Steppe
I solved the problem (if interested - see answer). The problem wasn't those Tf's outside a text-block. Those Tf's actually set the default fonts' details (I'm not saying it's allowed by the PDF specification - but that's what it actually does...). I thank you all once again.Szombathely
I'm not saying it's allowed by the PDF specification - but that's what it actually does... - it may do that in current versions of some pdf viewers but counting on that behavior to be still there in the next version is somewhat risky.Zacatecas
Please @BrunoLowagie and mkl, are you sure that those text operators are illegal?? According to PDF 1.7 (and its derivative ISO-32000-1) text state operators (like above-mentioned Tf and Tw) are pretty legal at page description level (that is OUTSIDE text objects)!! I don't (want to) know who is Bruno Lowagie, but I would have expected he had endorsed a correct answer, avoiding lazy assumptions that throw discredit over the quality of others' projects. thank you!Sulphathiazole
Don't worry @user1028741, text state operators like Tf, despite what Bruno Lowagie said, are absolutely LEGAL outside text objects. ;-)Sulphathiazole
@Stefano yes, you are right. I merely often saw text positioning or text drawing operations used outside text objects and path construction operations etc used inside them and do assumed such an issue to early. I found out myself sometime early this year that text state operations are legal outside text objects, too.Zacatecas
@Stefano Chizzolini: if you are developing PDF software, please join the ISO committee for PDF so that you're up to date. Another reason: Adobe owns most of the patents with respect to PDF, but grants every one who respects the specs a license to use those patents. That also means that whoever doesn't respect the specs may be infringing a patent. Being a member of the ISO committee gives you access to the people who write the specs, so it is in your interest to join.Steppe
@Bruno I am being proactive and referring to ISO-32000-2. - the latest draft I saw, 2014-0220, still allowed text state operators in text objects. Has it changed that much in between?Zacatecas
@BrunoLowagie "I am being proactive and referring to ISO-32000-2". - No way, sorry: you were explicitly referring to ISO-32000-1, NOT ISO-32000-2! Anyway, referring to ISO-32000-2 would be by any means pointless as we were just reasoning about the compatibility against the current spec.Sulphathiazole
@Zacatecas No problem, I appreciate your honest intention; I would just ask you to amend your original answer removing the passage about the wrong text operators (other users may still interpret it the wrong way). thank you!Sulphathiazole
Many things are in flux right now. Edinburgh will bring some important changes (not necessarily regarding Tf) which makes that the spec isn't to be expected before 2016. I don't understand the fuss Stefano Chizzolini is making, though.Steppe
I've updated mkl's answer. Based on the fact that the OP was able to fix the problem by rearranging text operators, there must have been some significant problem with those operators as indicated by mkl.Steppe
@BrunoLowagie I was irritated that your comments lacked technical accuracy (supposed illegality of text state operators outside text objects according to ISO-32000-1 (ISO-32000-2 is OT in this context)), especially considering that your initial comment expressed sort of disdain about my project.I am happy to solve possible issues regarding my library, but I expect fairness and respect (constructive comments, NOT destructive ones!). thank youSulphathiazole
I honestly don't know PdfClown and I don't need to know it because I wrote my own PDF library in 2000. iText was the first PDF library that was capable of being used in a web context. Many other developers tried to copy iText's success. Not many developers also wrote a book, vetted the IP of their code, created a business model to ensure sustained support and to make their product future-proof (e.g. by being part of the ISO committees that write the specs). Those are facts, stripped of emotions such as disdain.Steppe
@BrunoLowagie It was exactly because of your role in the IT community that I couldn't comprehend the quality of your contribution in this thread. anyway, that's it!Sulphathiazole
@StefanoChizzolini I would just ask you to amend your original answer - I updated the answer, and the bug the OP found, might indeed be the cause of the more than 2000 zero bytes at the end of the page 3 content stream.Zacatecas
Thank you mkl for your accurate summary, it's a perfect clarification.Sulphathiazole
S
1

the general post about debugging pdf might have been also helpful as rups / pdfstreamdump etc is mentioned there How do you debug PDF files?

Spaghetti answered 3/9, 2014 at 13:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.