Let's see... [switching into PDF debugging mode].
First, I unpack your full_template.pdf with the help of qpdf
, a command-line utility "that does structural, content-preserving transformations on PDF files" (self-description):
qpdf --qdf full_template.pdf qdf---test.pdf
The result, qdf---test.pdf is now more easy to analyse in a normal text editor, because all streams are unpacked.
Searching for the string "est" finds us this line:
[(T) 120 (est)] TJ
Poking around a bit more (and looking at qpdf
's very helpful comments sprinkled into its output!) we find this: the PDF object where your mirrored string "Test" appears in the original PDF is number 22. It is a completely separate object from the rest of the file's text, and it also is the only one that uses an un-embedded Helvetica font.
So let's extract that separately from the original file:
qpdf --show-object=22 --filtered-stream-data full_template.pdf
q
/DeviceRGB cs
0.000 0.000 0.000 scn
/DeviceRGB CS
0.000 0.000 0.000 SCN
1 w
0 J
0 j
[ ] 0 d
BT
286.55 797.384 Td
/F3.0 12 Tf
[<54> 120 <657374>] TJ
ET
Q
OK, here the piece [(T) 120 (est)] TJ
appears as [<54> 120 <657374>] TJ
. We verify this with the help of the ascii
command, that prints us a nice ASCII <-> Hex table. That table confirms:
T 54
e 65
s 73
t 74
What do the other operators mean? We look them up in the official ISO 32000 PDF-1.7 spec, Annex A, "Operator Summary". Here we find the following bits of info:
q : gsave
Q : grestore
cs : setcolorspace for nonstroking ops
CS : setcolorspace for stroking ops
scn : setcolor for nonstroking ops
SCN : setcolor for stroking ops
w : setlinewidth
j : setlinejoin
J : setlinecap
d : setdash
BT : begin text object
Td : move text position
Tf : set text font and size
TJ : show text allowing individual glyph positioning
Tj : show text
ET : end text object
Nothing suspicious so far...
However, looking at the other object where the original page content appears in, object number 5, we discover a difference. For example:
1 0 0 -1 -17.2308 -13.485 Tm
<0013001c001200130018001200140015> Tj
Here, before each single action of a Tj
(show text) the Tm
operator (What is this?!?) is in play. Let's also look up Tm
in the PDF spec:
Tm : set text matrix and text line matrix
What is strange however, is that this matrix uses 1 0 0 -1
(instead of the more common 1 0 0 1
). This leads to the up-side down mirroring of the text.
Wait a minute!?!
The original text content is stroked with a mirroring text matrix, but still appears normal?? But your added text doesn't use any text matrix of its own, but appears mirrored? What is going on?!
I'm not going to trace it down for more now. My assumption is however, that somewhere in the guts of the original PDF, the authoring software defined an 'extended graphics state' which causes all stroking operations to be mirrored by default.
It seems you've done nothing wrong, Sebastien -- you've just been unlucky with your choice of a test object, and got blessed with a rather weird one. Try it continue your 'Prawn' experiments with some other PDFs first...
One can "fix" your full_template.pdf by replacing this line in qdf---test.pdf:
286.55 797.384 Td
by this one:
1 0 0 -1 286.55 797.384 Tm
and then run a last qdf
command to fix the (now corrupted by our editing) PDF cross-reference table and stream lenghts:
qpdf qdf---test.pdf full_template---fixed.pdf
The console output will show you want it does:
WARNING: qdf---test.pdf: file is damaged
WARNING: qdf---test.pdf (file position 151169): xref not found
WARNING: qdf---test.pdf: Attempting to reconstruct cross-reference table
WARNING: qdf---test.pdf (object 8 0, file position 9072): attempting to recover stream length
qpdf: operation succeeded with warnings; resulting file may have some problems
The "fixed" PDF will show the text un-mirrored.
sample.pdf
is somehow corrupt. – Beefeatersample.pdf
isn't "corrupt" at all. It's perfectly "legal" PDF source code. It's just ..."weird" in the way that code is written. See the answers below. – Bathy