PDF Compression with iTextSharp [closed]
Asked Answered
T

6

11

I am currently trying to recompress a pdf that has already been created, I am trying to find a way to recompress the images that are in the document, to reduce the file size.

I have been trying to do this with the DataLogics PDE and iTextSharp libraries but I can not find a way to do the stream recompression of the items.

I have though about looping over the xobjects and getting the images and then dropping the DPI down to 96 or using the libjpeg C# implimentation to change the quality of the image but getting it back into the pdf stream seems to always end up, with memory corruption or some other issue.

Any samples will be appreciated.

Thanks

Tully answered 5/1, 2012 at 9:35 Comment(5)
See this https://mcmap.net/q/181516/-pdftk-compression-option you could also use ImageMagickBahia
as it's .NET, @Bahia problem is talking about imagemagick.codeplex.comAssociate
@Associate The problem is not resampling the images, but actually getting the images back into the pdf stream, you cannot save the images to disk as it will create issues with transparency etc..Tully
but you can save it into a MemoryStream and then back to the document as appending it or adding a page with that Stream correct?Associate
@Associate Well yes but I need to re-inject it into the stream and not work with the bitmaps.Tully
E
11

iText and iTextSharp have some methods for replacing indirect objects. Specifically there's PdfReader.KillIndirect() which does what it says and PdfWriter.AddDirectImageSimple(iTextSharp.text.Image, PRIndirectReference) which you can then use to replace what you killed off.

In pseudo C# code you'd do:

var oldImage = PdfReader.GetPdfObject();
var newImage = YourImageCompressionFunction(oldImage);
PdfReader.KillIndirect(oldImage);
yourPdfWriter.AddDirectImageSimple(newImage, (PRIndirectReference)oldImage);

Converting the raw bytes to a .Net image can be tricky, I'll leave that up to you or you can search here. Mark has a good description here. Also, technically PDFs don't have a concept of DPI, that's for printers mostly. See the answer here for more on that.

Using the method above your compression algorithm can actually do two things, physically shrink the image as well as apply JPEG compression. When you physically shrink the image and add it back it will occupy the same amount of space as the original image but with less pixels to work with. This will get you what you consider to be DPI reduction. The JPEG compression speaks for itself.

Below is a full working C# 2010 WinForms app targeting iTextSharp 5.1.1.0. It takes an existing JPEG on your desktop called "LargeImage.jpg" and creates a new PDF from it. Then it opens the PDF, extracts the image, physically shrinks it to 90% of the original size, applies 85% JPEG compression and writes it back to the PDF. See the comments in the code for more of an explanation. The code needs lots more null/error checking. Also looks for NOTE comments where you'll need to expand to handle other situations.

using System;
using System.Drawing;
using System.Drawing.Imaging;
using System.Drawing.Drawing2D;
using System.Windows.Forms;
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;

namespace WindowsFormsApplication1 {
    public partial class Form1 : Form {
        public Form1() {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e) {
            //Our working folder
            string workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
            //Large image to add to sample PDF
            string largeImage = Path.Combine(workingFolder, "LargeImage.jpg");
            //Name of large PDF to create
            string largePDF = Path.Combine(workingFolder, "Large.pdf");
            //Name of compressed PDF to create
            string smallPDF = Path.Combine(workingFolder, "Small.pdf");

            //Create a sample PDF containing our large image, for demo purposes only, nothing special here
            using (FileStream fs = new FileStream(largePDF, FileMode.Create, FileAccess.Write, FileShare.None)) {
                using (Document doc = new Document()) {
                    using (PdfWriter writer = PdfWriter.GetInstance(doc, fs)) {
                        doc.Open();

                        iTextSharp.text.Image importImage = iTextSharp.text.Image.GetInstance(largeImage);
                        doc.SetPageSize(new iTextSharp.text.Rectangle(0, 0, importImage.Width, importImage.Height));
                        doc.SetMargins(0, 0, 0, 0);
                        doc.NewPage();
                        doc.Add(importImage);

                        doc.Close();
                    }
                }
            }

            //Now we're going to open the above PDF and compress things

            //Bind a reader to our large PDF
            PdfReader reader = new PdfReader(largePDF);
            //Create our output PDF
            using (FileStream fs = new FileStream(smallPDF, FileMode.Create, FileAccess.Write, FileShare.None)) {
                //Bind a stamper to the file and our reader
                using (PdfStamper stamper = new PdfStamper(reader, fs)) {
                    //NOTE: This code only deals with page 1, you'd want to loop more for your code
                    //Get page 1
                    PdfDictionary page = reader.GetPageN(1);
                    //Get the xobject structure
                    PdfDictionary resources = (PdfDictionary)PdfReader.GetPdfObject(page.Get(PdfName.RESOURCES));
                    PdfDictionary xobject = (PdfDictionary)PdfReader.GetPdfObject(resources.Get(PdfName.XOBJECT));
                    if (xobject != null) {
                        PdfObject obj;
                        //Loop through each key
                        foreach (PdfName name in xobject.Keys) {
                            obj = xobject.Get(name);
                            if (obj.IsIndirect()) {
                                //Get the current key as a PDF object
                                PdfDictionary imgObject = (PdfDictionary)PdfReader.GetPdfObject(obj);
                                //See if its an image
                                if (imgObject.Get(PdfName.SUBTYPE).Equals(PdfName.IMAGE)) {
                                    //NOTE: There's a bunch of different types of filters, I'm only handing the simplest one here which is basically raw JPG, you'll have to research others
                                    if (imgObject.Get(PdfName.FILTER).Equals(PdfName.DCTDECODE)) {
                                        //Get the raw bytes of the current image
                                        byte[] oldBytes = PdfReader.GetStreamBytesRaw((PRStream)imgObject);
                                        //Will hold bytes of the compressed image later
                                        byte[] newBytes;
                                        //Wrap a stream around our original image
                                        using (MemoryStream sourceMS = new MemoryStream(oldBytes)) {
                                            //Convert the bytes into a .Net image
                                            using (System.Drawing.Image oldImage = Bitmap.FromStream(sourceMS)) {
                                                //Shrink the image to 90% of the original
                                                using (System.Drawing.Image newImage = ShrinkImage(oldImage, 0.9f)) {
                                                    //Convert the image to bytes using JPG at 85%
                                                    newBytes = ConvertImageToBytes(newImage, 85);
                                                }
                                            }
                                        }
                                        //Create a new iTextSharp image from our bytes
                                        iTextSharp.text.Image compressedImage = iTextSharp.text.Image.GetInstance(newBytes);
                                        //Kill off the old image
                                        PdfReader.KillIndirect(obj);
                                        //Add our image in its place
                                        stamper.Writer.AddDirectImageSimple(compressedImage, (PRIndirectReference)obj);
                                    }
                                }
                            }
                        }
                    }
                }
            }

            this.Close();
        }

        //Standard image save code from MSDN, returns a byte array
        private static byte[] ConvertImageToBytes(System.Drawing.Image image, long compressionLevel) {
            if (compressionLevel < 0) {
                compressionLevel = 0;
            } else if (compressionLevel > 100) {
                compressionLevel = 100;
            }
            ImageCodecInfo jgpEncoder = GetEncoder(ImageFormat.Jpeg);

            System.Drawing.Imaging.Encoder myEncoder = System.Drawing.Imaging.Encoder.Quality;
            EncoderParameters myEncoderParameters = new EncoderParameters(1);
            EncoderParameter myEncoderParameter = new EncoderParameter(myEncoder, compressionLevel);
            myEncoderParameters.Param[0] = myEncoderParameter;
            using (MemoryStream ms = new MemoryStream()) {
                image.Save(ms, jgpEncoder, myEncoderParameters);
                return ms.ToArray();
            }

        }
        //standard code from MSDN
        private static ImageCodecInfo GetEncoder(ImageFormat format) {
            ImageCodecInfo[] codecs = ImageCodecInfo.GetImageDecoders();
            foreach (ImageCodecInfo codec in codecs) {
                if (codec.FormatID == format.Guid) {
                    return codec;
                }
            }
            return null;
        }
        //Standard high quality thumbnail generation from http://weblogs.asp.net/gunnarpeipman/archive/2009/04/02/resizing-images-without-loss-of-quality.aspx
        private static System.Drawing.Image ShrinkImage(System.Drawing.Image sourceImage, float scaleFactor) {
            int newWidth = Convert.ToInt32(sourceImage.Width * scaleFactor);
            int newHeight = Convert.ToInt32(sourceImage.Height * scaleFactor);

            var thumbnailBitmap = new Bitmap(newWidth, newHeight);
            using (Graphics g = Graphics.FromImage(thumbnailBitmap)) {
                g.CompositingQuality = CompositingQuality.HighQuality;
                g.SmoothingMode = SmoothingMode.HighQuality;
                g.InterpolationMode = InterpolationMode.HighQualityBicubic;
                System.Drawing.Rectangle imageRectangle = new System.Drawing.Rectangle(0, 0, newWidth, newHeight);
                g.DrawImage(sourceImage, imageRectangle);
            }
            return thumbnailBitmap;
        }
    }
}
Endeavor answered 5/1, 2012 at 23:38 Comment(6)
Concerning DPI, it's based on the size and scale of the image against the size of the page. So different parts of the same page could have different DPI.Ostensorium
Yes, that's effectively what happens. But nowhere in the PDF language can you say "make this image 300 DPI". Instead you say "here's a 500 pixel wide image, scale it by 10%".Endeavor
@ChrisHaas all I am actually trying to do is get the pdf size downTully
Also with this method you will get white fragments where transparency was supposed to beTully
so I think the actual question is how to preserve the transparency when re-compressing the memory streamTully
@user1053237, if you want to shrink a PDF with iTextSharp then shrinking the images is the best way and this is how you do it. iTextSharp itself won't touch your images or apply any compression for you. (Okay, not technically correct, it will apply a degree of LOSSLESS compression but that doesn't affect already compressed data too much.) The sample above showed how to work with a JPG which obviously doesn't support transparency. You should be able to perform the same basic routine with PNGs but you'll need to switch encoders in ConvertImageToBytesEndeavor
O
7

I don't know about iTextSharp, but you have to rewrite a PDF file if anything is changed, as it contains an xref table (index) with the exact file position of each object. This means if even one byte is added or removed, the PDF becomes corrupted.

Your best bet for recompressing the images is JBIG2 if they are B&W, or JPEG2000 otherwise, for which Jasper library will happily encode JPEG2000 codestreams for placement into PDF files at whatever quality you so desire.

If it were me I'd do it all from code without the PDF libraries. Just find all images (anything between stream and endstream after an occurance of JPXDecode (JPEG2000), JBIG2Decode (JBIG2) or DCTDecode (JPEG)) pull that out, reencode it with Jasper, then stick it back in again and update the xref table.

To update the xref table, find the positions of each object (starting 00001 0 obj) and just update the new positions in the xref table. It's not too much work, less than it sounds. You might be able to get all the offsets with a single regular expression (I'm not a C# programmer, but in PHP it would be that simple.)

Then finally update the value of the startxref tag in the trailer with the offset of the beginning of the xref table (where it says xref in the file).

Otherwise you'll end up decoding the entire PDF and rewriting it all, which will be slow, and you might lose something along the way.

Ostensorium answered 5/1, 2012 at 11:17 Comment(8)
the issue comes in where you actually need to make sure the stream is not corruptTully
Which stream, the image stream? Why would the stream be corrupt?Ostensorium
Well to actually get it properly in the first place I tried to save the parts between stream and endstream to a image file on the disk and it could not be opened. I am now just writing each line to a new pdf file and then checking if I can get the line that contains the image into a memorystream or somethingTully
It sure can be opened if you save it correctly. There is a newline byte after the stream and before the endstream, you need to account for this and remove that from the variable before saving to file. Don't use a string trim function on it as that'll trim off some of the bytes that you want and corrupt the image. Then you just need to make sure you know what format it is. JPEG2000 embedded in codestream format without the full header, so the extension is .jpc, Jasper can process this. JPEGs (DCTDecode) should be normal.Ostensorium
If you can't open it from disk then keeping it in memory will NOT help. It is the same image. If it's corrupt in one it's corrupt in the other. If the image displays when you view the PDF, then the image can be extracted and saved to file. Your issue is probably with those newline bytes.Ostensorium
If the object says /Filter /DCTDecode then you extract everything between offsets of stream +7 and endstream -1 and save it to a file with extension .jpg. If it says /Filter /JPXDecode then do the same except extension .jpc. Then process it and put it back. If you change the format you'll also have to change the /Filter tag in the object, otherwise not. You WILL have to change the /Length tag, I forgot to mention that.Ostensorium
@Ostensorium - Your comment about Jasper encoding JBIG2 and JPEG2000 images is interesting. So the library can do compression on those two formats? Do you have a link, or is it the one here?Esparza
Yes that one, but Jasper is only for JPEG2000 & JPEG formats. For JBIG2 you need jbig2enc & jbig2dec, for encoding and decoding respectively (all open source).Ostensorium
E
6

There is an example on how to find and replace images in an existing PDF by the creator of iText. It's actually a small excerpt from his book. Since it's in Java, here's a simple replacement:

public void ReduceResolution(PdfReader reader, long quality) {
  int n = reader.XrefSize;
  for (int i = 0; i < n; i++) {
    PdfObject obj = reader.GetPdfObject(i);
    if (obj == null || !obj.IsStream()) {continue;}

    PdfDictionary dict = (PdfDictionary)PdfReader.GetPdfObject(obj);
    PdfName subType = (PdfName)PdfReader.GetPdfObject(
      dict.Get(PdfName.SUBTYPE)
    );
    if (!PdfName.IMAGE.Equals(subType)) {continue;}

    PRStream stream = (PRStream )obj;
    try {
      PdfImageObject image = new PdfImageObject(stream);
      PdfName filter = (PdfName) image.Get(PdfName.FILTER);
      if (
        PdfName.JBIG2DECODE.Equals(filter)
        || PdfName.JPXDECODE.Equals(filter)
        || PdfName.CCITTFAXDECODE.Equals(filter)
        || PdfName.FLATEDECODE.Equals(filter)
      ) continue;

      System.Drawing.Image img = image.GetDrawingImage();
      if (img == null) continue;

      var ll = image.GetImageBytesType();
      int width = img.Width;
      int height = img.Height;
      using (System.Drawing.Bitmap dotnetImg =
         new System.Drawing.Bitmap(img))
      {
        // set codec to jpeg type => jpeg index codec is "1"
        System.Drawing.Imaging.ImageCodecInfo codec =
        System.Drawing.Imaging.ImageCodecInfo.GetImageEncoders()[1];
        // set parameters for image quality
        System.Drawing.Imaging.EncoderParameters eParams =
         new System.Drawing.Imaging.EncoderParameters(1);
        eParams.Param[0] =
         new System.Drawing.Imaging.EncoderParameter(
           System.Drawing.Imaging.Encoder.Quality, quality
        );
        using (MemoryStream msImg = new MemoryStream()) {
          dotnetImg.Save(msImg, codec, eParams);
          msImg.Position = 0;
          stream.SetData(msImg.ToArray());
          stream.SetData(
           msImg.ToArray(), false, PRStream.BEST_COMPRESSION
          );
          stream.Put(PdfName.TYPE, PdfName.XOBJECT);
          stream.Put(PdfName.SUBTYPE, PdfName.IMAGE);
          stream.Put(PdfName.FILTER, filter);
          stream.Put(PdfName.FILTER, PdfName.DCTDECODE);
          stream.Put(PdfName.WIDTH, new PdfNumber(width));
          stream.Put(PdfName.HEIGHT, new PdfNumber(height));
          stream.Put(PdfName.BITSPERCOMPONENT, new PdfNumber(8));
          stream.Put(PdfName.COLORSPACE, PdfName.DEVICERGB);
        }
      }
    }
    catch {
      // throw;
      // iText[Sharp] can't handle all image types...
    }
    finally {
// may or may not help      
      reader.RemoveUnusedObjects();
    }
  }
}

You'll notice it's only handling JPEG. The logic is reversed (instead of explicitly handling only DCTDECODE/JPEG) so you can uncomment some of the ignored image types and experiment with the PdfImageObject in the code above. In particular, most of the FLATEDECODE images (.bmp, .png, and .gif) are represented as PNG (confirmed in the DecodeImageBytes method of the PdfImageObject source code). As far as I know, .NET does not support PNG encoding. There are some references to support this here and here. You can try a stand-alone PNG optimization executable, but you also have to figure out how to set PdfName.BITSPERCOMPONENT and PdfName.COLORSPACE in the PRStream.

For completeness sake, since your question specifically asks about PDF compression, here's how you compress a PDF with iTextSharp:

PdfStamper stamper = new PdfStamper(
  reader, YOUR-STREAM, PdfWriter.VERSION_1_5
);
stamper.Writer.CompressionLevel = 9;
int total = reader.NumberOfPages + 1;
for (int i = 1; i < total; i++) {
  reader.SetPageContent(i, reader.GetPageContent(i));
}
stamper.SetFullCompression();
stamper.Close();

You might also try and run the PDF through PdfSmartCopy to get the file size down. It removes redundant resources, but like the call to RemoveUnusedObjects() in the finally block, it may or may not help. That will depend on how the PDF was created.

IIRC iText[Sharp] doesn't deal well with JBIG2DECODE, so @Alasdair's suggestion looks good - if you want to take the time learning the Jasper library and using the brute-force approach.

Good luck.

EDIT - 2012-08-17, comment by @Craig:

To save the PDF after compressing the jpegs using the ReduceResolution() method above:

a. Instantiate a PdfReader object:

PdfReader reader = new PdfReader(pdf);

b. Pass the PdfReader to the ReduceResolution() method above.

c. Pass the altered PdfReader to a PdfStamper. Here's one way using a MemoryStream:

// Save altered PDF. then you can pass the btye array to a database, etc
using (MemoryStream ms = new MemoryStream()) {
  using (PdfStamper stamper = new PdfStamper(reader, ms)) {
  }
  return ms.ToArray();
}

Or you can use any other Stream if you don't need to keep the PDF in memory. E.g. use a FileStream and save directly to disk.

Esparza answered 6/1, 2012 at 22:15 Comment(6)
This is nice but how do you save the pdf once the you've compressed all the images. That bit of code is left out <sigh>.Melvinmelvina
Thanks for the code, I tried to implement it. But I get an error. Posted it as a seperate question here. #26256695Savell
This worked and compressed my PDF by 90% - Thank you!Krona
Writing to file part is not clear. When I write with code PdfStamer(reader, new FileStream(@"C:\outputfile.pdf", FileMode.Create) nothing changes. Meaning, file is still same. am I doing wrong?Domenic
@Intrigue, Can you mention your code section where you write the result from ReduceResolution to file? It isn't working for me, or I might be missing something.Domenic
super! It did compress the images in the pdf... I just changed this void to return a PdfReader, basically, I sent the Reader and get a Reader back... this time with images compressedCoontie
C
1

I am not sure if you are considering other libraries, but you can easily recompress existing images using Docotic.Pdf library (Disclaimer: I work for the company).

Here is some sample code:

static void RecompressExistingImages(string fileName, string outputName)
{
    using (PdfDocument doc = new PdfDocument(fileName))
    {
        foreach (PdfImage image in doc.Images)
            image.RecompressWithGroup4Fax();

        doc.Save(outputName);
    }
}

There are also RecompressWithFlate, RecompressWithGroup3Fax, RecompressWithJpeg and Uncompress methods.

The library will convert color images to bilevel ones if needed. You can specify deflate compression level, JPEG quality etc.

I am also ask you to think twice before using approach suggested by @Alasdair. If you are going to deal with PDF files that weren't created by you than the task is far more complex that it might seem.

To start with, there is great deal of images compressed by codecs other than JPXDecode, JBIG2Decode or DCTDecode. And PDF can also contain inline images.

PDF files saved using newer versions of standard (1.5 or newer) can contain cross-reference streams. It means that reading and updating such files is more complex than just finding/updating some numbers at the end of the file.

So, please, use a PDF library.

Courtmartial answered 8/1, 2012 at 10:1 Comment(0)
R
1

I've written a library to do just that. It will also OCR the pdf's using Tesseract or Cuneiform and create searchable, compressed PDF files. It's a library that uses several open source projects (iTextsharp, jbig2 encoder, Aforge, muPDF#) to complete the task. You can check it out here http://hocrtopdf.codeplex.com/

Ramage answered 5/5, 2012 at 15:55 Comment(0)
O
1

A simple way to compress PDF is using gsdll32.dll (Ghostscript) and Cyotek.GhostScript.dll (wrapper):

public static void CompressPDF(string sInFile, string sOutFile, int iResolution)
    {
        string[] arg = new string[]
        {
            "-sDEVICE=pdfwrite",
            "-dNOPAUSE",
            "-dSAFER",
            "-dBATCH",
            "-dCompatibilityLevel=1.5",
            "-dDownsampleColorImages=true",
            "-dDownsampleGrayImages=true",
            "-dDownsampleMonoImages=true",
            "-sPAPERSIZE=a4",
            "-dPDFFitPage",
            "-dDOINTERPOLATE",
            "-dColorImageDownsampleThreshold=1.0",
            "-dGrayImageDownsampleThreshold=1.0",
            "-dMonoImageDownsampleThreshold=1.0",
            "-dColorImageResolution=" + iResolution.ToString(),
            "-dGrayImageResolution=" + iResolution.ToString(),
            "-dMonoImageResolution=" + iResolution.ToString(),
            "-sOutputFile=" + sOutFile,
            sInFile
        };
        using(GhostScriptAPI api = new GhostScriptAPI())
        {
            api.Execute(arg);
        }
    }
Oversee answered 19/8, 2016 at 9:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.