PdfContentStreamEditor rotating image on PDF file
Asked Answered
C

2

1

I have what I hope is an easy question. I'm trying to use iTextSharp to modify some PDF files, however it seems that the XMP metadata that iTextSharp puts at the end of the files is ruining the layout of the PDF files (and I'm not very conversant in the PDF format to understand at all why).

Here's a small section of the original document And the same section from the 'edited' document You can see from the two images above that the document appears to have been rotated. From looking at the PDF files as binary differences however, the only thing different appears to be some XMP metadata at the end of the files

DIFF of files showing XMP metadata at end as only difference

I've tried opening the files in several PDF viewers (Sumatra PDF, Edge Browser and Adobe Acrobat) and all show the same weirdness.

I guess I have two questions: a) How can the PDF file be so altered from just having XMP meteadata at the end of the file? b) How can I make iTextSharp not produce this output? (iTextSharp only seems to do this when I Add/Edit content, and not if I just strip out Javascript or similar)

<EDIT 1>
The code that I'm using for the iTextSharp is the PdfContentStreamEditor (verbatim) from the post here: https://mcmap.net/q/1469305/-removing-watermark-from-pdf-itextsharp
</EDIT 1>
<EDIT 2>
Ok.. it seems that it's not the XMP Metadata. I got rid of that by using:

pdfStamper.XmpMetadata = new byte[0];

However there is still a bunch of extra data placed at the end of the file

2 0 obj
<</Producer(PDFCreator 2.5.2.5233; modified using iTextSharp’ 5.5.13 ©2000-2018 iText Group NV \(AGPL-version\))/CreationDate(D:20171206173510+10'30')/ModDate(D:20180325144710+11'00')/Title(þÿ
endobj
404 0 obj
<</Length 0/Type/Metadata/Subtype/XML>>stream

endstream
endobj
405 0 obj
<</Length 3638/Filter/FlateDecode>>stream
xœÍZmÅ/6ÒZ2ÁÆ€
....

</EDIT 2>

Carafe answered 25/3, 2018 at 3:23 Comment(10)
Probably there is an issue with my PdfContentStreamEditor class. To verify I'd need the PDF in question, though.Ivelisseivens
I have another PDF that also seems to show 'weirdness' when put through the code. I can send this instead, since it doesn't contain any of our privileged corporate info. How best to send to you? I did have a look through the Adobe PDF spec, because I was surprised by the Write method putting a space / newline into the output (I was expecting a full 1:1 write through)... but it seemed valid (albeit, as noted, I don't know anything about the PDF format)Carafe
If there is no privileged info in it anymore, you can simply share the file by means of e.g. a public Google drive or drop box share.Ivelisseivens
Here's both an original, and one that has been through the PdfContentStreamEditor (without any editing supposed to have been performed). I only did the EditContent call on the first page, so the other pages are still healthy. drive.google.com/open?id=1KSXgoPgkUX9atCPQXDcx86T30xLBgJYJCarafe
The file you shared appears to have a feature that covers pages with a note under some circumstances. Such features can be quite sensitive to document changes. I'll try and understand that feature better someone the next days.Ivelisseivens
By the way, you should change the title of your question as that obviously is not anymore what you are trying to do.Ivelisseivens
Ok, I can reproduce the scrambling of the text on the first page... in contrast to the original code, though, I had to use append mode for that. As far as I can see now, the cause has to do with the password protection of the document (it is encrypted using the default password, so one does not have to enter a password but it is encrypted nonetheless which is why Adobe Reader shows "(SECURED)" thereafter). I'll look into that.Ivelisseivens
I created an answer for the issue with this example file. The problem rotating the contents surely is a different matter, though. If possible, also share that file, please.Ivelisseivens
I've added a set of revised files to the same google drive share as before, they are generated from PHA-Pro, and Cute PDF Writer... I suspect that it's an issue with page rotation as the entire document is landscape, whilst the resultant page seems to have the content rotated to be portrait (but still on a landscape document layout).Carafe
I added a section to my answer which explains the rotation and also how to prevent it.Ivelisseivens
I
2

You have indeed found a bug in the PdfContentStreamEditor I used in this answer while the other issue requires one to know how to disable a special feature or quirk (depending on the circumstances) of iText.

Rotation of the content

This part deals with the rotation of content in the sample document PHA-Pro 8 - File.pdf provided by the OP.

As you already have seen yourself, the rotation issue appears connected with the fact that the page rotation of the page in question is not 0.

Indeed, the iText PdfStamper has a feature which in case of rotated pages automatically rotates additions one applies to the OverContent or UnderContent. This feature can be quite handy if you want to add upright content to the page without having to apply rotation yourself to make it upright. In case of the PdfContentStreamEditor, though, all coordinates we receive from the existing content already have the applicable rotation factored in.

Thus, we need to disable this feature. One can do so using the PdfStamper property RotateContents:

using (PdfReader pdfReader = new PdfReader(source))
using (PdfStamper pdfStamper = new PdfStamper(pdfReader, new FileStream(dest, FileMode.Create, FileAccess.Write), (char)0, true))
{
    pdfStamper.RotateContents = false;
    PdfContentStreamEditor editor = new PdfContentStreamEditor();

    for (int i = 1; i <= pdfReader.NumberOfPages; i++)
    {
        editor.EditPage(pdfStamper, i);
    }
}

Scrambling of text

This part deals with the scrambling of text in the sample document AS62061-2006.pdf provided by the OP.

You have found a bug in the PdfContentStreamEditor. Its Write method contains this loop:

foreach (PdfObject pdfObject in operands)
{
    pdfObject.ToPdf(canvas.PdfWriter, canvas.InternalBuffer);
    canvas.InternalBuffer.Append(operands.Count > ++index ? (byte) ' ' : (byte) '\n');
}

It should instead be

foreach (PdfObject pdfObject in operands)
{
    pdfObject.ToPdf(null, canvas.InternalBuffer);
    canvas.InternalBuffer.Append(operands.Count > ++index ? (byte) ' ' : (byte) '\n');
}

If one presents the PdfWriter to the ToPdf method of a PdfString and the PdfWriter uses encryption, the string contents are getting encrypted. But here the string is written to a stream, and in that case not the individual string must be encrypted but instead eventually the whole stream.

This applies to the PDF provided by the OP because

  • the PDF is encrypted using the default password and
  • the OP edited using a PdfStamper in append mode which encrypts the additions using the same password as the original file.

With the original code, the result looks like this:

broken page content

With the fixed code, it looks like this:

proper page content

Ivelisseivens answered 25/3, 2018 at 21:14 Comment(0)
D
1

I can answer your second question. The metadata you are trying to remove is not supposed to be removed. The DLL of the AGPL version that you are using will add that metadata, no matter what you do with code. You will not be able to remove it with iText as it is a direct violation of their licence terms. Please refer to : https://itextpdf.com/AGPL

You must prominently mention iText and include the iText copyright and AGPL license in output file metadata.

Driskell answered 25/3, 2018 at 4:10 Comment(9)
OK, is it possible that this iText / AGPL metadata is causing the display issues that I'm seeing? It does seem to end with a bunch of stuff like standard PDF elements... <</Type/Page/MediaBox[0 0 595 842]/Rotate 90/Parent 3 0 R/Resources<</ProcSet[/PDF/ImageC/Text]/ExtGState<</R7 7 0 R>>/XObject<</R16 16 0 R/R11 11 0 R/R10 10 0 R>>/Font<</R12 12 0 R/R14 14 0 R/R8 8 0 R>>>>/Contents[405 0 R]>> endobj xref ..... trailer <</Size 406/Root 1 0 R/Info 2 0 R/ID [<37cc5304dd427b4bc45682ed176e8b90><994a08ca7d4b4a1737acb7b2820c620c>]/Prev 2532558>> %iText-5.5.13 startxref 2545243 %%EOFCarafe
Metadata never changes the visual representation of a PDF. However, I read that (1) you don't know ISO 32000 that well, but (2) you are editing content streams. That is a contradiction. That's like saying to the PDF: I'm not a surgeon nor a brain specialist, but I'm going to do brain surgery on you. If you want to edit content streams, you need to know what you're doing.Bosom
@bruno Perhaps I'm wrong then... and there's more than just metadata involved. An original and damaged PDF can be found here drive.google.com/open?id=1KSXgoPgkUX9atCPQXDcx86T30xLBgJYJCarafe
Because you use iText in an AGPL context, where can we see your entire code? Somewhere on GitHub maybe? I mean, your files are obviously proprietary, but your code isn't. (because AGPL)Incomputable
@Amadee-van-gasse, my entire code consists of one Windows Form, one CS file from mkl's stackoverflow post and the itextsharp nuget package. It has not been distributed or made available for use by anyone outside of me.. it also currently doesn't do anything beside spit out a wonky page 1. As you know from the AGPL, if it's not being made available for use by others or distributed then the modified work source code does not need to be released... However if I do get something working, then I will put it on github. If not, I will bin it, and do something else.Carafe
@AmedeeVanGasse since things are now somewhat working, you'll be pleased to know that I've put the source on github github.com/bevanweiss/PdfEditor Really not much to go on... still not sure what's causing the rotation on landscape pages, but it might be useful for some people... It seems to address what like 90% of posts are about, replacing text in a PDF. I realise it has huge limitations, but 'it works for me'. The auto-redacting feature is already coming in handy for me also...Carafe
@Carafe For the redaction part I'd propose using the PdfCleanUp classes from the iTextSharp Extra package as they (to a certain degree) do remove the redacted content. The iText 7 pdfSweep module is based thereupon.Ivelisseivens
@Ivelisseivens thanks again :) Yeah, the redaction part could be done more robustly for sure. I guess the best way would be a combination, if it's a Tj element found, then convert it to a TJ, remove the redacted text string and put a shift in the direction of the missing text, then once the text is removed put the black bar overlay to indicate that it has been redacted.Carafe
@Carafe Actually a generic solution requires quite a lot more. Do have a look at the PdfCleanUp stuff, it is not perfect but already does consider a lot of stuff.Ivelisseivens

© 2022 - 2024 — McMap. All rights reserved.