Converting HTML files to PDF [closed]
Asked Answered
D

8

141

I need to automatically generate a PDF file from an exisiting (X)HTML-document. The input files (reports) use a rather simple, table-based layout, so support for really fancy JavaScript/CSS stuff is probably not needed.

As I am used to working in Java, a solution that can easily be used in a java-project is preferable. It only needs to work on windows systems, though.

One way to do it that is feasable, but does not produce good quality output (at least out of the box) is using CSS2XSLFO, and Apache FOP to create the PDF files. The problem I encountered was that while CSS-attributes are converted nicely, the table-layout is pretty messed up, with text flowing out of the table cell.

I also took a quick look at Jrex, a Java-API for using the Gecko rendering engine.

Is there maybe a way to grab the rendered page from the internet explorer rendering engine and send it to a PDF-Printer tool automatically? I have no experience in OLE programming in windows, so I have no clue what's possible and what is not.

Do you have an idea?

Distil answered 11/3, 2009 at 8:57 Comment(8)
I've recently created a Java library docbag that can convert xhtml to pdf documents. Current version is not anything advanced, but if your xhtml templates are simple this library may come handy.Brownedoff
I think the way to go is to use the browsers capabilities to do the translation. See https://mcmap.net/q/45949/-how-to-use-the-browser-39-s-chrome-firefox-html-css-js-rendering-engine-to-produce-pdf/39998Chondro
I am stuck with generating pdf from a html that contains Cyrillic letters. Everything's fine except Cyrillic letters which are omitted. Anyone who got this kinda problem?Boding
@krisiliev: I had similar issues, and as far as I can remember, the font used was very important. Most fonts do not support complete UTF8 characters, but the following should: ' font-family: Arial Unicode MS;' (CSS). Also make sure to use the correct encoding (I would advise to always use UTF-8)Distil
this linked helped me hmkcode.com/itext-html-to-pdf-using-javaSuperlative
This question is off-topic at SO, but on-topic in softwarerecs.SE. See How can I convert HTML with CSS to PDF?.Kato
@Jakub Torbicki you posted a broken link ,it does not work for me !Unblushing
How would be the answer today in 2020? I suggest that one would use Print CSS and then use a modern HTML2PDF engine do produce the binary PDF output to be sent to the client's browser?Metalinguistics
G
80

The Flying Saucer XHTML renderer project has support for outputting XHTML to PDF. Have a look at an example here.

Godevil answered 11/3, 2009 at 9:22 Comment(3)
The real problem with flying sauser is that it uses itext to render PDF, which is a AGPL v3 licenced libChondro
The version of itext used by Flying Saucer is 2.0.8 which was available under LGPL. Only version numbers 5 or above are on the more restrictive license. #2692500Dominy
I'd say the real problem with Flying Saucer is that it requires a well-formed and valid XML document. It's easy to unwittingly break the PDF rendering by including something like an ampersand in your HTML, or some javascript code that makes your rendered HTML not strict XHTML. Though this can be mitigated with automated tests or some process that involves XML validation.Circulate
O
54

Did you try WKHTMLTOPDF?

It's a simple shell utility, an open source implementation of WebKit. Both are free.

We've set a small tutorial here

EDIT( 2017 ):

If it was to build something today, I wouldn't go that route anymore.
But would use http://pdfkit.org/ instead.
Probably stripping it of all its nodejs dependencies, to run in the browser.

Otherness answered 31/8, 2009 at 20:45 Comment(19)
For a straight html-page-to-pdf conversion, this is better than anything else I've seen, free or commercial.Timberlake
Does it work on a non Mac OS?Eckardt
@Eran, we use it on linux. I think there's a windows version tooOtherness
@Otherness Yes, there is a Windows version too.Advertise
tested on windows XP (version 0.9.9) and works very well. Also, does not require admin rights on the machine to install.Episcopal
why can't we use the real browser for that instead of the fork of the (now unmantained) rendering engine ? See https://mcmap.net/q/45949/-how-to-use-the-browser-39-s-chrome-firefox-html-css-js-rendering-engine-to-produce-pdf/39998Chondro
@DavidHofmann, probably because this question dates back to 2009. From the last check I did few months ago, there was still no comparable solution in JSOtherness
How would this work in a threaded Enterprise environment that would be generating several hundred pdf files a minute?Bounds
@IcedDante, what makes you think there would be a problem?Otherness
I guess what I am wondering is if this shell utility creates its own memory space for each invocation or if it operates like a utility in headless mode where each thread would be using a shared resourceBounds
@IcedDante, we have a similar load of pdf as yours, but we queue them in a background job, to preserve server performances. And run them one by one. However if I remember well, in the beginning we made some tests, and there was no collision on concurrent calls.Otherness
i love you for this reference. great utilityWolfe
It's JavaScript, not Java....Offense
@CardinalSystem it's neither JS nor Java, just a command line tool over the library of WKHTMLTOPDF written in cOtherness
For many simple cases , I still do recommend using a wkhtmltopdf binaryContrabandist
Can confirm wkhtmltopdf is a great tool, and easy to use. I've been using it for years and still use it frequently.Cryoscope
From Java, you can use github.com/wooio/htmltopdf-java which is a wrapper around wkhtmltopdfBuddle
@Danielany may I ask, if you have any experience using it in a web server environment? I mean I think, it won't play nicely with a web server spawning new process for each client request.Presentable
@ayanahmedov, yes we do that for about 13 years now, on an Ubuntu server with nginxOtherness
I
48

Check out iText; it is a pure Java PDF toolkit which has support for reading data from HTML. I used it recently in a project when I needed to pull content from our CMS and export as PDF files, and it was all rather straightforward. The support for CSS and style tags is pretty limited, but it does render tables without any problems (I never managed to set column width though).

Creating a PDF from HTML goes something like this:

Document doc = new Document(PageSize.A4);
PdfWriter.getInstance(doc, out);
doc.open();
HTMLWorker hw = new HTMLWorker(doc);
hw.parse(new StringReader(html));
doc.close();
Illampu answered 11/3, 2009 at 9:32 Comment(5)
It's AGPL, seems even worse than GPL, you need to be open source even if you just serve the PDF and iText is server side.Eckardt
@Eran, Just use the last non-AGPL version (com.lowagie:itext:2.1.7 in Maven).Feune
HTMLWorker is deprecated in newer versions of IText in favor of XMLWorker; however CSS support is poor in both cases (see demo.itextsupport.com/xmlworker/itextdoc/…) and was not adequate for my needs. On the contrary Flying Saucer was perfect.Whitehorse
You may use LGPL version which could be found at github.com/albfernandez/itext2Iosep
HTMLWorker supports very simple HTML documents, with basic elements and no CSS. It is too limited to be useful. But the more recent iText html2pdf works really great kb.itextpdf.com/home/it7kb/ebooks/…Hern
A
4

If you have the funding, nothing beats Prince XML as this video shows

Arthro answered 11/3, 2009 at 9:17 Comment(2)
If you're looking for a cheaper alternative for Prince, try DocRaptor.com. It uses Prince as the engine.Stoppage
And if you want to cheaper, but with more options, try htm2pdf.co.uk - it uses webkit and users real WYSIWIGBreastsummer
Y
4

Is there maybe a way to grab the rendered page from the internet explorer rendering engine and send it to a PDF-Printer tool automatically?

This is how ActivePDF works, which is good means that you know what you'll get, and it actually has reasonable styling support.

It is also one of the few packages I found (when looking a few years back) that actually supports the various page-break CSS commands.


Unfortunately, the ActivePDF software is very frustrating - since it has to launch the IE browser in the background for conversions it can be quite slow, and it is not particularly stable either.

There is a new version currently in Beta which is supposed to be much better, but I've not actually had a chance to try it out, so don't know how much of an improvement it is.

Yogini answered 11/3, 2009 at 9:47 Comment(2)
Thanks for the helpful answer. I don't think ActivePDF is really suitable because of the price, but it's good to know something like that exists.Distil
GrabzIt's HTML to PDF API: grabz.it/html-to-pdf-image-api.aspx Works in the same way it renders the HTML in a browser and then creates the PDF this ensures that there is much more accurate PDF conversions.Benis
T
2

You can use a headless firefox with an extension. It's pretty annoying to get running but it does produce good results.

Check out this answer for more info.

Turro answered 11/3, 2009 at 9:22 Comment(2)
Doesnt sound like a very scalable solution if one needs to convert pages on the fly to pdf in parallel. If a few requests come thru that result in a conversion using FF your server will have lost a few GIG of memory just to serve a few converted pages. This would open your server to a DOS.Sepsis
Better but similar: github.com/ariya/phantomjs/wiki/Screen-Capture (according to we-love-php.blogspot.com/2012/12/… the pdf has real text, not rasterized)Pendulum
S
0

If you look at the side bar of your question, you will see many related questions...

In your context, the simpler method might be to install a PDF print driver like PDFCreator and just print the page to this output.

Shelashelagh answered 11/3, 2009 at 9:34 Comment(2)
How is this a Java solution? This is a windows print driver.Perishable
The OP explicitly mentioned Windows. And I suppose there are similar drivers for other systems. The OP only mentioned Java as a possible solution...Shelashelagh
S
0

Amyuni WebkitPDF could be used with JNI for a Windows-only solution. This is a HTML to PDF/XAML conversion library, free for commercial and non-commercial use.

If the output files are not needed immediately, for better scalability it may be better to have a queue and a few background processes taking items from there, converting them and storing then on the database or file system.

usual disclaimer applies

Sacrificial answered 26/9, 2012 at 18:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.