Generate big PDF from huge amount of data

Asked 25/6, 2014 at 16:38 Answered 3/7, 2014 at 22:45

I read data from database from which I generate HTML DOM. The data volume is huge so it cannot fit in memory at once, however it can be provided chunk-by-chunk.

I would like to transform resulting HTML into PDF using Flying Saucer:

import org.xhtmlrenderer.pdf.ITextRenderer;
import org.dom4j.DocumentFactory;
import org.dom4j.Element;
import org.dom4j.io.DOMWriter;

OutputStream bodyStream = outputMessage.getBody();

ITextRenderer renderer = new ITextRenderer();

DocumentFactory documentFactory = DocumentFactory.getInstance();
DOMWriter domWriter = new DOMWriter();

Element htmlNode = documentFactory.createElement("html");
Document htmlDocument = documentFactory.createDocument(htmlNode);

int currentLine = 1;
int currentPage = 1;

try {
    while (currentLine <= numberOfLines) {
        currentLine += loadDataToDOM(documentFactory, htmlNode, currentLine, CHUNK_SIZE);

        renderer.setDocument(domWriter.write(htmlDocument), null);
        renderer.layout();

        if (currentPage == 1) {
            // For the first page the PDF writer is created:
            renderer.createPDF(bodyStream, false);
        }
        else {
            // Other documents are appended to current PDF writer:
            renderer.writeNextDocument(currentPage);
        }

        currentPage += renderer.getRootBox().getLayer().getPages().size();
    }

    // Finalise the PDF:
    renderer.finishPDF();
}
catch (DocumentException e) {
    throw new IOException(e);
}
catch (org.dom4j.DocumentException e) {
    throw new IOException(e);
}
finally {
    IOUtils.closeQuietly(bodyStream);
}

The problem with this approach is that the last page of chunk is not necessarily completely filled with data. Is there any solution to fill the space? For example I could think about the approach that will check that last page is not filed completely and then discard it (not write to PDF), also find out which data was rendered on that page and rewind the position in database (currentLine in example). Would be nice if one can post a complete solution.

Subcontraoctave answered 25/6, 2014 at 16:38 Comment(3)

Bad idea. First you create the HTML which takes plenty of space, then you use that HTML to create PDF. If memory matters, you should create the PDF straight from the data without first creating the HTML. – Cochleate 25/6, 2014 at 16:45

Yes, but how much code will I need to write to render the HTML using iText low-level primitives (moveTo(), lineTo(), beginText())? Now I have 50 lines of code, easy to manage. HTML and CSS are familiar to everyone. Changing the layout or colors is no problem. Bruno, I have looked briefly your book "iText in action" (many thanks for it!) and already headers/footers magic on page 430 (chapter 14) is scaring. I would happily use com.itextpdf.tool.xml.pipeline.html.HtmlPipeline but it does not support basic CSS selectors, not saying about floating boxes. – Subcontraoctave 26/6, 2014 at 11:8

Why would you use low-level primitives? I'll give you some pointers to easy examples in an answer. – Cochleate 26/6, 2014 at 13:51

As I already mentioned in the comments, you are wasting memory and processing time by creating a PDF from a data source by creating HTML first and then converting the HTML to PDF. You're also introducing plenty of unnecessary complexity.

In your comment, you mention low-level functionality such as moveTo() and lineTo(). It would indeed be madness to draw a table using low-level operations that draw every single line and ever single word.

You should use the PdfPTable class. The ArrayToTable example is a very simple POC where the data comes in the form of a List<List<String>>. The code is as simple as this:

PdfPTable table = new PdfPTable(8);
table.setWidthPercentage(100);
List<List<String>> dataset = getData();
for (List<String> record : dataset) {
    for (String field : record) {
        table.addCell(field);
    }
}
document.add(table);

Of course: you are talking about a huge data set, in which case, you may not want to build up the table in memory first and then flush the memory when the table is added to the document. You'll want to add small parts of the table while you are building it. That's what happens in the MemoryTests example. Add this line:

table.setComplete(false);

And you can add the table little by little (in the example: every 10 rows). When you've finished adding cells to the table, you should do this:

table.setComplete(true);
document.add(table);

This will add the final rows.

If you want a table with a repeating header and/or footer, take a look at the tables in this PDF: header_footer_1.pdf

The HeaderFooter1 and HeaderFooter2 examples will show you how it's done.

Cochleate answered 26/6, 2014 at 14:1 Comment(2)

Thanks for detailed answer, I appreciate. In principle I have "tabled data" (here is an example with each ell having a border while here is a non-draft version). Each cell may in turn contain other text boxes with background. If I understand correctly I need to represent each piece with com.itextpdf.text.Chunk object and then combine them into com.itextpdf.text.Phrase? – Subcontraoctave 30/6, 2014 at 16:0

The colored backgrounds for the arbitrary pieces of text is indeed something you can either achieve with ´Chunk.setBackground()´ or with generic tag functionality (for instance: if the background isn't a rectangle). Looking at the desired output, I wouldn't use PdfPTable. Instead I'd use a ColumnText object and Chunk.TABBING for the tabs separating the <xyz> numbers and the actual data. – Cochleate 30/6, 2014 at 16:6

This is not an answer to the precise question you asked, so if this post is useless I'll delete it.

Since the document is huge, you may well get the best results by emitting the data as LaTeX and then running it through pdflatex.

Advantages:

LaTeX source of the kind you need is simple to emit - no more complicated than HTML.
The whole TeX system is designed to produce beautiful and huge documents. LaTeX is processed as a stream of pages. The number of pages has essentially no effect on RAM resources required.
You get the full power of a typesetting language to make your pages look great. Want fancy headers? Nicely positioned page numbers? Section headings? Clickable Table of Contents, etc. etc. No problem.
LaTeX is available free for all major operating systems.

Disadvantages:

LaTeX is a native executable, not a Java lib.

If you are interested in this, I can flesh out more details.

Neonatal answered 3/7, 2014 at 22:45 Comment(1)

I am aware about LaTeX. There are two more disadvantages: (1) Processing time. Calling external utility is time-expensive. More over LaTeX has big eco-system, which takes time to load. (2) Adding yet another technology to project makes it more difficult to maintain. HTML is more-or-less familiar to everybody. But instructions like \rfoot{Page \thepage} need some efforts to explore. I would suppose that \textbf{\thepage} will work fine inside header/footer definition, but more exotic styling like creating a colored box is already beyond my understanding of what is "simple". – Subcontraoctave 26/8, 2014 at 10:6

Recommended topics

Hot tags