Using iTextPDF to trim a page's whitespace
Asked Answered
Y

1

1

I have a pdf which comprises of some data, followed by some whitespace. I don't know how large the data is, but I'd like to trim off the whitespace following the data

    PdfReader reader = new PdfReader(PDFLOCATION);
    Rectangle rect = new Rectangle(700, 2000);
    Document document = new Document(rect);
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(SAVELCATION));

     document.open();

        int n = reader.getNumberOfPages();
        PdfImportedPage page;
        for (int i = 1; i <= n; i++) {
            document.newPage();
            page = writer.getImportedPage(reader, i);
            Image instance = Image.getInstance(page);
            document.add(instance);
        }
        document.close();

Is there a way to clip/trim the whitespace for each page in the new document? This PDF contains vector graphics.

I'm usung iTextPDF, but can switch to any Java library (mavenized, Apache license preferred)

Yaws answered 21/11, 2013 at 19:7 Comment(3)
Does your PDF contain vector graphics? Or just text and bitmap images?Boswall
It contains vector graphicsYaws
Hm, Ok, that's unfortunate. The reason i asked was because there is a sample class included with itext which determines the bounding box of all text and bitmaps on a page. Unfortunately it does not yet take vector graphics into account.Boswall
B
4

As no actual solution has been posted, here some pointers from the accompanying itext-questions mailing list thread:

  1. As you want to merely trim pages, this is not a case of PdfWriter + getImportedPage usage but instead of PdfStamper usage. Your main code using a PdfStamper might look like this:

    PdfReader reader = new PdfReader(resourceStream); 
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream("target/test-outputs/test-trimmed-stamper.pdf")); 
    
    // Go through all pages 
    int n = reader.getNumberOfPages(); 
    for (int i = 1; i <= n; i++) 
    { 
        Rectangle pageSize = reader.getPageSize(i); 
        Rectangle rect = getOutputPageSize(pageSize, reader, i); 
    
        PdfDictionary page = reader.getPageN(i); 
        page.put(PdfName.CROPBOX, new PdfArray(new float[]{rect.getLeft(), rect.getBottom(), rect.getRight(), rect.getTop()})); 
        stamper.markUsed(page); 
    } 
    stamper.close(); 
    

    As you see I also added another argument to your getOutputPageSize method to-be. It is the page number. The amount of white space to trim might differ on different pages after all.

  2. If the source document did not contain vector graphics, you could simply use the iText parser package classes. There even already is a TextMarginFinder based on them. In this case the getOutputPageSize method (with the additional page parameter) could look like this:

    private Rectangle getOutputPageSize(Rectangle pageSize, PdfReader reader, int page) throws IOException 
    { 
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        TextMarginFinder finder = parser.processContent(page, new TextMarginFinder());
        Rectangle result = new Rectangle(finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());
        System.out.printf("Text/bitmap boundary: %f,%f to %f, %f\n", finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());
        return result;
    }
    

    Using this method with your file test.pdf results in:

    enter image description here

    As you see the code trims according to text (and bitmap image) content on the page.

  3. To find the bounding box respecting vector graphics, too, you essentially have to do the same but you have to extend the parser framework used here to inform its listeners (the TextMarginFinder essentially is a listener to drawing events sent from the parser framework) about vector graphics operations, too. This is non-trivial, especially if you don't know PDF syntax by heart yet.

  4. If your PDFs to trim are not too generic but can be forced to include some text or bitmap graphics in relevant positions, though, you could use the sample code above (probably with minor changes) anyways.

    E.g. if your PDFs always start with text on top and end with text at the bottom, you could change getOutputPageSize to create the result rectangle like this:

    Rectangle result = new Rectangle(pageSize.getLeft(), finder.getLly(), pageSize.getRight(), finder.getUry());
    

    This only trims top and bottom empty space:

    enter image description here

    Depending on your input data pool and requirements this might suffice.

    Or you can use some other heuristics depending on your knowledge on the input data. If you know something about the positioning of text (e.g. the heading to always be centered and some other text to always start at the left), you can easily extend the TextMarginFinder to take advantage of this knowledge.


Recent (April 2015, iText 5.5.6-SNAPSHOT) improvements

The current development version, 5.5.6-SNAPSHOT, extends the parser package to also include vector graphics parsing. This allows for an extension of iText's original TextMarginFinder class implementing the new ExtRenderListener methods like this:

@Override
public void modifyPath(PathConstructionRenderInfo renderInfo)
{
    List<Vector> points = new ArrayList<Vector>();
    if (renderInfo.getOperation() == PathConstructionRenderInfo.RECT)
    {
        float x = renderInfo.getSegmentData().get(0);
        float y = renderInfo.getSegmentData().get(1);
        float w = renderInfo.getSegmentData().get(2);
        float h = renderInfo.getSegmentData().get(3);
        points.add(new Vector(x, y, 1));
        points.add(new Vector(x+w, y, 1));
        points.add(new Vector(x, y+h, 1));
        points.add(new Vector(x+w, y+h, 1));
    }
    else if (renderInfo.getSegmentData() != null)
    {
        for (int i = 0; i < renderInfo.getSegmentData().size()-1; i+=2)
        {
            points.add(new Vector(renderInfo.getSegmentData().get(i), renderInfo.getSegmentData().get(i+1), 1));
        }
    }

    for (Vector point: points)
    {
        point = point.cross(renderInfo.getCtm());
        Rectangle2D.Float pointRectangle = new Rectangle2D.Float(point.get(Vector.I1), point.get(Vector.I2), 0, 0);
        if (currentPathRectangle == null)
            currentPathRectangle = pointRectangle;
        else
            currentPathRectangle.add(pointRectangle);
    }
}

@Override
public Path renderPath(PathPaintingRenderInfo renderInfo)
{
    if (renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP)
    {
        if (textRectangle == null)
            textRectangle = currentPathRectangle;
        else
            textRectangle.add(currentPathRectangle);
    }
    currentPathRectangle = null;

    return null;
}

@Override
public void clipPath(int rule)
{
}

(Full source: MarginFinder.java)

Using this class to trim the white space results in

enter image description here

which is pretty much what one would hope for.

Beware: The implementation above is far from optimal. It is not even correct as it includes all curve control points which is too much. Furthermore it ignores stuff like line width or wedge types. It actually merely is a proof-of-concept.

All test code is in TestTrimPdfPage.java.

Boswall answered 26/11, 2013 at 8:47 Comment(5)
Thanks. I ended up just adding a string at the lower right edge of the graphic, (your 4th recommendation) which allowed the TextMarginFinder to find the margin around the graphic elementYaws
While working on this answer a bug in MarginFinder became appearant, a NullPointerException situation in renderPath it is fixed now.Boswall
Is there any analog in itext7 ? itext7 hasn't IExtRenderListener interfaceUnvoiced
Yes, there is; itext 7 has generalized the event listener mechanism a bit, you don't need to implement a different interface, you merely have to register for additional event types.Boswall
@Unvoiced You can find a port of the MarginFinder in this answer and here on github.Boswall

© 2022 - 2024 — McMap. All rights reserved.