Generation of PDF from HTML with non-Latin characters using ITextRenderer does not work
Asked Answered
K

3

16

This is the 2nd day I spend investigating with no results. At least now, I am able to ask something very specific.

I am trying to write a valid HTML code that contains some non-Latin characters in a PDF file using iText and more specifically using ITextRenderer from Flying Saucer.

My short example/code starts by initializing a string variable doc with this value:

String doc = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"en\">"
            + "<body>Some greek characters: Καλημέρα Some greek characters"
            + "</body></html>";

Here is the code that I use for debugging purposes. I save this string to HTML file and then I open it through a browser just to double check that HTML content is valid and I can still read Greek characters:

//write for debugging purposes in an html file
File newTextFile = new File("C:/work/test.html");
FileWriter fw = new FileWriter(newTextFile);
fw.write(doc);
fw.close();

Next step is to try to write this value in the PDF file. This is my code:

ITextRenderer renderer = new ITextRenderer();
    //add some fonts - if paths are not right, an exception will be thrown
    renderer.getFontResolver().addFont("c:/work/fonts/TIMES.TTF", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
    renderer.getFontResolver().addFont("c:/work/fonts/TIMESBD.TTF", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
    renderer.getFontResolver().addFont("c:/work/fonts/TIMESBI.TTF", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
    renderer.getFontResolver().addFont("c:/work/fonts/TIMESI.TTF", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);


    final DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory
            .newInstance();
    documentBuilderFactory.setValidating(false);
    DocumentBuilder builder = documentBuilderFactory.newDocumentBuilder();
    builder.setEntityResolver(FSEntityResolver.instance());
    org.w3c.dom.Document document = builder.parse(new ByteArrayInputStream(
            doc.toString().getBytes("UTF-8")));

    renderer.setDocument(document, null);
    renderer.layout();
    renderer.createPDF(os);

The final outcome of my code is:

In HTML file I get: Some greek characters: Καλημέρα Some greek characters (expected)

In PDF file I get: Some greek characters: Some greek characters (unexpected - greek characters are ignored!!)

Dependencies:

  • java version "1.6.0_27"

  • itext-2.0.8.jar

  • de.huxhorn.lilith.3rdparty.flyingsaucer.core-renderer-8Pre2.jar

I also have been experimented with much more fonts, but I guess that my problem has nothing to do with using wrong fonts. Any help is more than welcome.

Thanx

Keenan answered 20/4, 2012 at 17:13 Comment(0)
P
13

i am from Czech Republic, and had same problem with our national symbols! After some searching, i managed to solve it with this solution.

Specifically with (which you already have):

renderer
    .getFontResolver()
    .addFont(fonts.get(i).getFile().getPath(), 
             BaseFont.IDENTITY_H, 
             BaseFont.NOT_EMBEDDED);

and then important part in CSS:

* {
  font-family: Verdana;
/*  font-family: Times New Roman; - alternative. Without ""! */
}

It seems to me, without that css, your fonts are not used. When i remove theese lines from CSS, encoding is broken again.

Hope this will help!

Poundfoolish answered 9/7, 2012 at 16:20 Comment(4)
Thank you for the correct solution! Specifying the font (in my case it was DejaVu Serif) worked!Morez
Is this an OS agnostic solution?Corrasion
Thank you , it works for Turkish chars too. ITextRenderer renderer = new ITextRenderer();String fontpath = "fonts/arial.ttf".replace("/", File.separator); renderer.getFontResolver().addFont(fontpath, BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED); renderer.setDocument(url); renderer.layout(); renderer.createPDF(os, true);Verdellverderer
If anyone is struggling with that issue, I have used "just for fun" to load font Arial with the encoding of 'windows-1250'. It seems that it works :) fontResolver.addFont(fontResource.getUrl().getPath(), "windows-1250", true) Hope this will save someone his day.Spillman
T
9

Add to your HTML something like this:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv='Content-Type' content='text/html; charset=UTF-8'/>
        <style type='text/css'> 
            * { font-family: 'Arial Unicode MS'; }
        </style>
    </head>
    <body>
        <span>Some text with šđčćž characters</span>
    </body>
</html>

and then add FontResolver to ITextRenderer in java code:

ITextRenderer renderer = new ITextRenderer();
renderer.getFontResolver().addFont("fonts/ARIALUNI.TTF", BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);

works great for Croatian characters

jars used for generating PDF are:

core-renderer.jar
iText-2.0.8.jar
Tumular answered 2/12, 2013 at 10:1 Comment(2)
Thanks. It works for Turkish carachters too. I used it different from you just encoding parameter such as renderer.getFontResolver().addFont("D:\\Fonts\\arial\\arial.ttf", "Cp1254", BaseFont.EMBEDDED); and at html, <style type='text/css'> * { font-family: 'Arial'; } </style>Verdellverderer
Is there a OS agnostic solution out there?Corrasion
A
0

Let the iText read a header info from your html content that it contains utf-8 content.
Add meta tag for content-type in html code with utf-8 charset encoding then run iText to generate PDF and check the result.

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
 </head>
 <body>
  Some greek characters: Καλημέρα Some greek characters
 </body>
</html>

Update:
If the above is not working, then refer to ENCODING VERSUS THE DEFAULT CHARSET USED BY THE JVM in the document published at http://www.manning.com/lowagie2/iText2E_MEAP_CH02.pdf

Airframe answered 20/4, 2012 at 19:29 Comment(4)
Just tried that with no good news :( I am getting the same result @Ravinder I think you missed a </meta> in your example :PKeenan
I added in my test this: System.out.println("file.encoding=" + System.getProperty("file.encoding")); which prints as a result this: file.encoding=UTF-8. Should this be enough to ensure that I am having the right encoding?Keenan
@alexandros: Not sure, but I suggest you look into the article referred written by Bruno Lowagie, original developer of iText.Airframe
Thanx, I will have a look and I will let you know if I find any solutionKeenan

© 2022 - 2024 — McMap. All rights reserved.