Are byte order marks allowed in PDF document?
Asked Answered
G

2

7

I'm having an issue with a filter program I wrote. It detects if a file is a PDF document by reading the first 5 bytes of the file and comparing it to a fixed buffer :

25 50 44 46 2D

This works fine except that I'm seeing a few files that starts with a byte order mark instead:

EF BB BF 25 50 44 46 2D ^-------^

I'm wondering if that is actually allowed by the PDF specs. If I check section 7.5 of that documentation, I read it as "no":

The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7

Yet, I see these documents in the wild and the users gets confused because PDF reader programs can open these documents by my filter reject them.

So: are BOM markers allowed at the start of PDF documents ? (I'm NOT talking about string objects here but the PDF file itself)

Goldman answered 15/10, 2015 at 15:30 Comment(0)
S
10

So: are BOM markers allowed at the start of PDF documents ?

No, just like you read in the specification, nothing is allowed before the "%PDF" bytes.

But Adobe Reader has a long history of accepting files in spite of some leading or trailing trash bytes.

Cf. the implementation notes in Appendix H of Adobe's pdf_reference_1-7:

3.4.1, “File Header”

  1. Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file.

  2. Acrobat viewers also accept a header of the form

    %!PS−Adobe−N.n PDF−M.m
    

...

3.4.4, “File Trailer”

  1. Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.

And people have a tendency to think that a PDF that Adobe Reader displays as desired is valid, there are many PDFs in the wild that do have trash bytes up front.

Squirt answered 15/10, 2015 at 16:11 Comment(0)
A
4

No, a BOM is not valid at the front a PDF file.

A PDF is a binary file format so a BOM wouldn't actually make sense, it would be like having a BOM at the front of a ZIP file or a JPEG.

I'm guessing the PDFs that you are consuming are coming from misconfigured applications that either have something already at the front of their output buffer already or, more likely, are created with the incorrect assumption that a PDF is a text-based format.

Aoudad answered 15/10, 2015 at 16:8 Comment(4)
Your last paragraph is actually not correct. Many applications specifically added binary data at the front of PDF files in order to force file transfer protocols to handle the file as binary and not break the PDF file by mistreating line-endings between platforms. As Adobe Acrobat has always handled this correctly (and thus other PDF readers out of necessity as well), it wasn't a big deal.Consumer
We might be splitting hairs but I still stand by that statement. The spec actually recommends that after the ASCII version header authors should include a comment section with four binary characters to force a binary transfer if their PDF contains binary data (which most do these days). That's not a BOM at the beginning of the file as the OP asked however. (Its not really a BOM in any way, actually.) Also, in my 15+ years of web development I've never put junk data in front of any binary file to force it to download, there's a dedicated HTTP header for that.Aoudad
I'm not saying you did it :) But it was commonly done. I've written PDF preflight software and PDF files with a bunch of junk at the front (not a BOM of course) were very common. And it was not done by faulty software but very deliberately.Consumer
I generally count humans in the "faulty software" category ;)Aoudad

© 2022 - 2024 — McMap. All rights reserved.