Performance of wkhtmltopdf
Asked Answered
S

5

30

We are intending to use wkhtmltopdf to convert html to pdf but we are concerned about the scalability of wkhtmltopdf. Does anyone have any idea how it scales? Our web app potentially could attempt to convert hundreds of thousands of (reletively complex)html so it's important for us to have some idea. Has anyone got any information on this?

Spar answered 24/7, 2012 at 4:20 Comment(2)
Rounded CSS Corners was causing my renders to increase in time by 20X. Removing these dropped my rendering from ~6sec to ~0.3sec on a relatively simple HTML page. eg. border-radius: 8px; and border-top-left-radius: 6px;Gustav
In my case after alot of investigation , i had url for fetching QRcode from third party that was holding creating my tickets.Planetoid
M
26

First of all, your question is quite general; there are many variables to consider when asking about scalability of any project. Obviously there is a difference between converting "hundreds of thousands" of HTML files over a week and expecting to do that in a day, or an hour. On top of that "relatively complex" HTML can mean different things to other people.

That being said, I figured since I have done something similar to this, converting approximately 450,000 html files, utilizing wkhtmltopdf; I'd share my experience.

Here was my scenario:

  • 450,000 HTML files
    • 95% of the files were one page in length
    • generally containing 2 images (relative path, local system)
    • tabular data (sometimes contained nested tables)
    • simple markup elsewhere (strong, italic, underline, etc)
  • A spare desktop PC
    • 8GB RAM
    • 2.4GHz Dual Core Processor
    • 7200RPM HD

I used a simple single threaded script written in PHP, to iterate over the folders and pass the html file path to wkhtmltopdf. The process took about 2.5 days to convert all the files, with very minimal errors.

I hope this gives you insight to what you can expect from utilizing wkhtmltopdf in your web application. Some obvious improvements would come from running this on better hardware but mainly from utilizing a multi-threaded application to process files simultaneously.

Mohican answered 25/7, 2012 at 1:59 Comment(3)
FYI for anyone who doesn't like to do math, that averages to 480ms per docWheen
Or 2 pages per second.Scalariform
In my experience on hyper-threaded quad-core CPU I was able to get around ~4,000 simple invoices to generate in ~30 minutes, but only by splitting the invoices into 4 batches and throwing them at the web server at the same time. The 4 requests then get processed concurrently. Increasing the number of requests any more than that would risk the web server crashing for me.Logographic
L
11

In my experience performance depends a lot on your pictures. It there are lots of large pictures it can slow down significantly. If at all possible I would try to stage a test with an estimate of what the load would be for your servers. Some people do use it for intensive operations, but I have never heard of hundrerds of thousands. I guess like everything, it depends on your content and resources.

The following quote is straight off the wkhtmltopdf mailing list:

I'm using wkHtmlToPDF to convert about 6000 E-mails a day to PDF. It's all done on a quadcore server with 4GB memory... it's even more then enough for that.

There are a few performance tips, but I would suggest trying out what is your bottlenecks before optimizing for performance. For instance I remember some person saying that if possible, loading images directly from disk instead of having a web server inbetween can speed it up conciderably.


Edit: Adding to this I just had some fun playing with wkhtmltopdf. Currently on an Intel Centrino 2 with 4Gb memory I generate PDF with 57 pages of content (mixed p,ul,table), ~100 images and a toc takes consistently < 7 seconds. I'm also running visual studio, browser, http server and various other software that might slow it down. I use stdin and stdout directly instead of files.


Edit: I have not tried this, but if you have linked CSS, try embedding it in the HTML file (remember to do a before and after test to see the effects properly!). The improvement here most likely depends on things like caching and where the CSS is served - if it's read from disk every time or god forbid regenerated from scss, it could be pretty slow, but if the result is cached by the webserver (I dont think wkhtmltopdf caches anything between instances) it might not have a big effects. YMMV.

Lute answered 24/7, 2012 at 9:6 Comment(3)
PLUS ONE for the images from disk instead of web server in between. I just tested it and saved 70% of generation time !Denims
One thing I'd add to this answer is if you've linked CSS, you should try embedding it in the HTML file. That should also save some time.Scalariform
Nice Tips! I am also using WKHTML via ProcessStart to process around 10 Html Pages (+ customized footers). Had processed just over 22 Million PDFs and they take up-to 2 seconds per PDF which sometimes I feel is bit much.Dniester
U
3

We try to use wkhtmltopdf in any implementations. My objects are huge tables for generated coordinate points. Typically volume of my pdf = 500 pages

We try to use port of wkhtmltopdf to .net. Results are

- Pechkin - Pro: don't need other app. Contra: slow. 500 pages generated about 5 minutes
- PdfCodaxy - only contra: slow. Slower than pure wkhtmltopdf. Required installed wkhtmltopdf. Problems with non unicode text
- Nreco - only contra: slow. Slower than pure wkhtmltopdf. Required installed wkhtmltopdf. Incorrect unlock libs after use (for me)

We try to use binary wkhtmltopdf invoked from C# code.

Pro: easy to use, faster that libs
Contra: need temporary files (cannot use Stream objects). Break with very huge (100MB+)html files as like as other libs
Unionize answered 10/9, 2014 at 11:41 Comment(3)
Regarding NReco.PdfGenerator, I have no idea how it can be slower than pure WkHtmlToPdf (internally it invokes WkHtmlToPdf.exe in separate process). Also it does NOT require installed WkHtmlToPdf: all files are embedded into DLL and extracted automatically if missed.Harkey
@VitaliyFedorchenko Perhaps "badma" is re-using a single child process by sending jobs via Standard Input (--read-args-from-stdin) while avoiding the penalty of launching the process, whereas [spitballing here] Nreco is launching the wkhtmltopdf process for every PDF file.Naughty
@Naughty NReco.PdfGenerator is also can utilize "--read-args-from-stdin" with "BeginBatch" / "EndBatch" API (note that this API is available only for commercial users with a license key).Harkey
G
3

wkhtmltopdf --print-media-type is blazing fast. But you loose normal CSS styling with that.

This may NOT be an ideal solution for complex html pages export. But it worked for me because my html contents are pretty simple and in tabular form.

Tested on version wkhtmltopdf 0.12.2.1

Genipap answered 3/3, 2015 at 22:18 Comment(2)
Weirdly, I experienced a degradation of performance when I tried this. It took twice as long for some reason.Logographic
--print-media-type just ignores css-styles that is not defined as "print"-styles, so it all depends on where you put your styles. I don't get why this would be "blazing fast" other than that? Why is this not ideal for complex html? It is all about the CSS it is rendering.Dysphemia
M
2

You can create own pool of the wkhtmltopdf engines. I did it for a simple use case by invoking API directly instead of start process wkhtmltopdf.exe every time. The wkhtmltopdf API is not thread-safe, so it's not easy to do. Also, you should not forget about sharing a native code between AppDomains.

Maidamaidan answered 16/3, 2020 at 16:11 Comment(1)
can you give any sort of code example instead of just a general idea?Evelunn

© 2022 - 2024 — McMap. All rights reserved.