How does Apache commons IO convert my XML header from UTF-8 to UTF-16?

Asked 16/2, 2015 at 17:4 Answered 17/2, 2015 at 19:24

Solved java utf-8 apache-commons utf-16 document-conversion

I’m using Java 6. I have an XML template, which begins like so

<?xml version="1.0" encoding="UTF-8"?>

However, I notice when I parse and output it with the following code (using Apache Commons-io 2.4) …

    Document doc = null;
    InputStream in = this.getClass().getClassLoader().getResourceAsStream(“my-template.xml”);

    try
    {
        byte[] data = org.apache.commons.io.IOUtils.toByteArray( in );
        InputSource src = new InputSource(new StringReader(new String(data)));

        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = factory.newDocumentBuilder();
        doc = builder.parse(src);
    }
    finally
    {
        in.close();
    }

The first line is output as

<?xml version="1.0" encoding="UTF-16”?>

What do I need to do when parsing/outputting the file so that the header encoding will remain “UTF-8”?

Edit: Per the suggestion given, I changed my code to

    Document doc = null;
    InputStream in = this.getClass().getClassLoader().getResourceAsStream(name);

    try
    {
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = factory.newDocumentBuilder();
        doc = builder.parse(in);
    }
    finally
    {
        in.close();
    }

But despite the fact my input element template file's first line is

<?xml version="1.0" encoding="UTF-8"?>

when i output the document as a String it produces

<?xml version="1.0" encoding="UTF-16"?>

as a first line. Here's what I use to output the "doc" object as a string ...

private String getDocumentString(Document doc)
{
    DOMImplementationLS domImplementation = (DOMImplementationLS)doc.getImplementation();
    LSSerializer lsSerializer = domImplementation.createLSSerializer();
    return lsSerializer.writeToString(doc);  
}

Rafaelle answered 16/2, 2015 at 17:4 Comment(0)

Turns out that when I changed my Document -> String method to

private String getDocumentString(Document doc)
{
    String ret = null;
    DOMSource domSource = new DOMSource(doc);
    StringWriter writer = new StringWriter();
    StreamResult result = new StreamResult(writer);
    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer;
    try
    {
        transformer = tf.newTransformer();
        transformer.transform(domSource, result);
        ret = writer.toString();
    }
    catch (TransformerConfigurationException e)
    {
        e.printStackTrace();
    }
    catch (TransformerException e)
    {
        e.printStackTrace();
    }
    return ret;
}

the 'encoding="UTF-8"' headers no longer got output as 'encoding="UTF-16"'.

Rafaelle answered 17/2, 2015 at 19:24 Comment(0)

new StringReader(new String(data))

This is wrong. You should let the parser detect the document encoding by using (for example) DocumentBuilder.parse(InputStream):

doc = builder.parse(in);

What encoding the DOM is serialized to depends on how you write it. The in-memory DOM has no concept of encoding.

Writing the document to a string with a UTF-8 declaration:

import java.io.StringWriter;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.ls.*;

public class DomIO {
    public static void main(String[] args) throws Exception {
        Document doc = DocumentBuilderFactory.newInstance()
                                             .newDocumentBuilder()
                                             .newDocument();
        doc.appendChild(doc.createElement("foo"));
        System.out.println(getDocumentString(doc));
    }

    public static String getDocumentString(Document doc) {
        DOMImplementationLS domImplementation = (DOMImplementationLS) 
                                                 doc.getImplementation();
        LSSerializer lsSerializer = domImplementation.createLSSerializer();
        LSOutput lsOut = domImplementation.createLSOutput();
        lsOut.setEncoding("UTF-8");
        lsOut.setCharacterStream(new StringWriter());
        lsSerializer.write(doc, lsOut);
        return lsOut.getCharacterStream().toString();
    }
}

The LSOutput also has binary stream support if you want the serializer to encode the document correctly on output.

Halophyte answered 16/2, 2015 at 17:10 Comment(1)

Hi, Thanks. Despite implementing this, the first line of my document template is still being output with the "UTF-16" encoding header instead of the "UTF-8" one. Maybe its the way I'm converting the Document object to as String, which I edited my question to include. – Rafaelle 16/2, 2015 at 17:49

Turns out that when I changed my Document -> String method to

private String getDocumentString(Document doc)
{
    String ret = null;
    DOMSource domSource = new DOMSource(doc);
    StringWriter writer = new StringWriter();
    StreamResult result = new StreamResult(writer);
    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer;
    try
    {
        transformer = tf.newTransformer();
        transformer.transform(domSource, result);
        ret = writer.toString();
    }
    catch (TransformerConfigurationException e)
    {
        e.printStackTrace();
    }
    catch (TransformerException e)
    {
        e.printStackTrace();
    }
    return ret;
}

the 'encoding="UTF-8"' headers no longer got output as 'encoding="UTF-16"'.

Rafaelle answered 17/2, 2015 at 19:24 Comment(0)

Recommended topics

Hot tags