transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8") is NOT working
Asked Answered
S

8

19

I have the following method to write an XMLDom to a stream:

public void writeToOutputStream(Document fDoc, OutputStream out) throws Exception {
    fDoc.setXmlStandalone(true);
    DOMSource docSource = new DOMSource(fDoc);
    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "xml");
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty(OutputKeys.INDENT, "no");
    transformer.transform(docSource, new StreamResult(out));
}

I am testing some other XML functionality, and this is just the method that I use to write to a file. My test program generates 33 test cases where files are written out. 28 of them have the following header:

<?xml version="1.0" encoding="UTF-8"?>...

But for some reason, 1 of the test cases now produce:

<?xml version="1.0" encoding="ISO-8859-1"?>...

And four more produce:

<?xml version="1.0" encoding="Windows-1252"?>...

As you can clearly see, I am setting ENCODING output key to UTF-8. These tests used to work on an earlier version of Java. I have not run the tests in a while (more than a year) but running today on "Java(TM) SE Runtime Environment (build 1.6.0_22-b04)" I get this funny behavior.

I have verified that the documents causing the problem were read from files that originally had those encoding. It seems that the new versions of the libraries are attempting to preserve the encoding of the source file that was read. But that is not what I want ... I really do want the output to be in UTF-8.

Does anyone know of any other factor that might cause the transformer to ignore the UTF-8 encoding setting? Is there anything else that has to be set on the document to say to forget the encoding of the file that was originally read?

UPDATE:

I checked out the same project out on another machine, built and ran the tests there. On that machine all the tests pass! All the files have "UTF-8" in their header. That machine has "Java(TM) SE Runtime Environment (build 1.6.0_29-b11)" Both machines are running Windows 7. On the new machine that works correctly, jdk1.5.0_11 is used to make the build, but on the old machine jdk1.6.0_26 is used to make the build. The libraries used for both builds are exactly the same. Can it be a JDK 1.6 incompatibility with 1.5 at build time?

UPDATE:

After 4.5 years, the Java library is still broken, but due to the suggestion by Vyrx below, I finally have a proper solution!

public void writeToOutputStream(Document fDoc, OutputStream out) throws Exception {
    fDoc.setXmlStandalone(true);
    DOMSource docSource = new DOMSource(fDoc);
    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "xml");
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
    transformer.setOutputProperty(OutputKeys.INDENT, "no");
    out.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>".getBytes("UTF-8"));
    transformer.transform(docSource, new StreamResult(out));
}

The solution is to disable the writing of the header, and to write the correct header just before serializing the XML to the output steam. Lame, but it produces the correct results. Tests broken over 4 years ago are now running again!

Stavropol answered 23/3, 2013 at 21:2 Comment(6)
There are several places to check for your Locale. Your local computer has a locale, your IDE might have a Locale, and your JVM process has a Locale. I've seen issues like this before when my Locales were changing. How are you running the tests? java.exe, maven, IDE?Matlock
As I have specified UTF-8 directly, the locale should not matter, but to answer your question directly, the test code is invoked as a command line call to Java.exe, on a windows system, located on the pacific coast of USA, and configured for US English and Pacific timezone.Stavropol
It could possibly be something to do with java 1.6, there are some similar kind of bugs reported bugs.sun.com/bugdatabase/view_bug.do?bug_id=4504745Glaikit
I can't confirm that. I just tested with jdk1.8.0_91 it still failed. So I upgraded to the latest jdk1.8.0_181 and it still fails. Even though I has specified output encoding to be in UTF-8, and even it actually is encoded in UTF-8, the HEADER is declared to be ISO-8859-1. I think I will stick with the work-around.Stavropol
Sounds like this Java bug: bugs.openjdk.org/browse/JDK-8227616Scoter
@joe23 I should try again and see if that fixed it, but so long ago, I no longer have access to that source code. :-(Stavropol
T
7

I had the same problem on Android when serializing emoji characters. When using UTF-8 encoding in the transformer the output was HTML character entities (UTF-16 surrogate pairs), which would subsequently break other parsers that read the data.

This is how I ended up solving it:

StringWriter sw = new StringWriter();
sw.write("<?xml version=\"1.0\" encoding=\"UTF-8\" ?>");
Transformer t = TransformerFactory.newInstance().newTransformer();

// this will work because we are creating a Java string, not writing to an output
t.setOutputProperty(OutputKeys.ENCODING, "UTF-16"); 
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
t.transform(new DOMSource(elementNode), new StreamResult(sw));

return IOUtils.toInputStream(sw.toString(), Charset.forName("UTF-8"));
Tyrannize answered 6/12, 2017 at 21:31 Comment(2)
Yes, that looks like it works. I am NOT a fan of converting my entire XML tree to a string in memory (particularly given that StringWriter is not efficient at it). I really insist on streaming directly to the output. A possible solution is instead of adding the header after serialization, to write the header to the output stream BEFORE serializing the XML without a header to the same output stream. I will see if that works.Stavropol
I have rewritten this idea to properly use streams and giving you the credit for the answer. (thanks!) As you wrote it, you would have three copies of the document in memory at the same time. For small XML not a problem, but in general having three copies of an important data file in memory is not efficient. A better approach is to simply write the header before serializing the XML to the writer. I rewrote your answer to make it only 2 copies of the XML in memory.Stavropol
T
2

To answer the question following code works for me. This can take input encoding and convert the data into output encoding.

        ByteArrayInputStream inStreamXMLElement = new ByteArrayInputStream(strXMLElement.getBytes(input_encoding));
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db = dbf.newDocumentBuilder(); 
        Document docRepeat = db.parse(new InputSource(new InputStreamReader(inStreamXMLElement, input_encoding)));
        Node elementNode = docRepeat.getElementsByTagName(strRepeat).item(0);

        TransformerFactory tFactory = null;
        Transformer transformer = null;
        DOMSource domSourceRepeat = new DOMSource(elementNode);
        tFactory = TransformerFactory.newInstance();
        transformer = tFactory.newTransformer();
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, output_encoding);

        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        StreamResult sr = new StreamResult(new OutputStreamWriter(bos, output_encoding));


        transformer.transform(domSourceRepeat, sr);
        byte[] outputBytes = bos.toByteArray();
        strRepeatString = new String(outputBytes, output_encoding);
Tonsorial answered 16/4, 2014 at 18:3 Comment(1)
The error occurs only on some versions of Java. I have not had time to run a full investigation of exactly what environment causes the problem, nor even time to post the test code here, however it is substantially similar to what you post. What was failing was automated tests that had run for years. The code you included looks like a good example of how to test for the problem. I don't know whether I will be able to go back to the original environment that was failing and re-run the tests there. All, in the fullness of time...Stavropol
T
1

I've spent significant amount of time debugging this issue because it was working well on my machine (Ubuntu 14 + Java 1.8.0_45) but wasn't working properly in production (Alpine Linux + Java 1.7).

Contrary to my expectation following from above mentioned answer didn't help.

ByteArrayOutputStream bos = new ByteArrayOutputStream();
StreamResult sr = new StreamResult(new OutputStreamWriter(bos, "UTF-8"));

but this one worked as expected

val out = new StringWriter()
val result = new StreamResult(out)
Twilley answered 23/10, 2015 at 13:40 Comment(0)
A
1

I could work around the problem by wrapping the Document object passed to the DOMSource constructor. The method getXmlEncoding of my wrapper always returns null, all other methods are delegated to the wrapped Document object.

Anomalous answered 6/7, 2016 at 20:12 Comment(1)
This is IMHO the "cleanest" solution since it addresses directly the TransformerFactory bug: the implementation of TrAX overwrites the specified encoding with the one taken from getXmlEncoding.Hydrogenolysis
B
0

what about?:

public static String documentToString(Document doc) throws Exception{ return(documentToString(doc,"UTF-8")); }//
   public static String documentToString(Document doc, String encoding) throws Exception{
     TransformerFactory transformerFactory =TransformerFactory.newInstance();
     Transformer transformer = null;

if ( "".equals(validateNullString(encoding) ) ) encoding = "UTF-8";
try{
    transformer = transformerFactory.newTransformer();
    transformer.setOutputProperty(OutputKeys.INDENT, "yes") ;
    transformer.setOutputProperty(OutputKeys.ENCODING, encoding) ;
}catch (javax.xml.transform.TransformerConfigurationException error){
    return null;
}

Source source = new DOMSource(doc);    
StringWriter writer = new StringWriter();
Result result = new StreamResult(writer);

try{
    transformer.transform(source,result);
}catch (javax.xml.transform.TransformerException error){
    return null;
}
return writer.toString();    
}//documentToString
Berman answered 27/11, 2014 at 11:29 Comment(0)
P
0

Use Saxon TransformerFactoryImpl, (Saxon HE >= 10.3):

public void writeToStream(Document doc, OutputStream output) throws TransformerException, IOException
    {
        TransformerFactory transformerFactory =
            TransformerFactory.newInstance("net.sf.saxon.TransformerFactoryImpl", null);
        transformerFactory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
        transformerFactory.setAttribute(XMLConstants.ACCESS_EXTERNAL_DTD, "");
        transformerFactory.setAttribute(XMLConstants.ACCESS_EXTERNAL_STYLESHEET, "");
        Transformer transformer = transformerFactory.newTransformer();
        DOMSource source = new DOMSource(doc);
        transformer.transform(source, new StreamResult(output));

    }

This solved this issues at my side.

Phoenicia answered 23/2, 2021 at 8:29 Comment(0)
J
-1

I'm taking a wild shot here, but you mention that you are reading files for the data of the tests. Can you make sure that you that you read the files using the proper encoding so when you write into your OutputStream you already have the data in the proper encoding?

So having something like new InputStreamReader(new FileInputStream(fileDir), "UTF8").

Don't forget that single-argument constructors of FileReader always use the platform default encoding : The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate.

Johnsen answered 11/9, 2013 at 13:41 Comment(1)
I never use FileReader. --- The DOM "Document" uses character string values which means they have already been converted from their original form. I am using the Java DOM utilities to read the file directly from the byte stream. The stream is expected to be interpreted according to the XML header that specifies encoding. This is how XML works. --- The file appears to be read correctly, and it is written in the encoding specified -- just not the encoding that I requested that it write in.Stavropol
A
-1

Try setting the encoding on your StreamResult specifically:

StreamResult result = new StreamResult(new OutputStreamWriter(out, "UTF-8"));

This way, it should only be able to write out in UTF-8.

Asuncionasunder answered 4/11, 2014 at 4:39 Comment(1)
The problem is that the 'header' is incorrect. If the header says that it is ISO-8859-1 then I would not want it to be actually encoded in some other way. I need both the header and the actual encoding of the stream. That is why with these libraries I always use input/output streams and not reader/writer ... because the standard says that you have to read the header to find out what the encoding is.Stavropol

© 2022 - 2024 — McMap. All rights reserved.