Keep numeric character entity characters such as `
 
` when parsing XML in Java
Asked Answered
P

2

5

I am parsing XML that contains numeric character entity characters such as (but not limited to) &#10; &#13; &lt; &gt; (line feed carriage return < >) in Java. While parsing, I am appending text content of nodes to a StringBuffer to later write it out to a textfile.

However, these unicode characters are resolved or transformed into newlines/whitespace when I write the String to a file or print it out.

How can I keep the original numeric character entity characters symbols when iterating over nodes of an XML file in Java and storing the text content nodes to a String?

Example of demo xml file:

<?xml version="1.0" encoding="UTF-8"?>
<ABCD version="2">    
    <Field attributeWithChar="A string followed by special symbols &#13;  &#10;" />
</ABCD>

Example Java code. It loads the XML, iterates over the nodes and collects the text content of each node to a StringBuffer. After the iteration is over, it writes the StringBuffer to the console and also to a file (but no &#10; &#13;) symbols.

What would be a way to keep these symbols when storing them to a String? Could you please help me? Thank you.

public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException, TransformerException {   
    DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
    Document document = null;
    DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
    document = documentBuilder.parse(new File("path/to/demo.xml"));
    StringBuilder sb = new StringBuilder();

    NodeList nodeList = document.getElementsByTagName("*");
    for (int i = 0; i < nodeList.getLength(); i++) {
        Node node = nodeList.item(i);
        if (node.getNodeType() == Node.ELEMENT_NODE) {
            NamedNodeMap nnp = node.getAttributes();
            for (int j = 0; j < nnp.getLength(); j++) {
                sb.append(nnp.item(j).getTextContent());
            }
        }
    }
    System.out.println(sb.toString());

    try (Writer writer = new BufferedWriter(new OutputStreamWriter(
            new FileOutputStream("path/to/demo_output.xml"), "UTF-8"))) {
        writer.write(sb.toString());
    }
}
Pean answered 18/3, 2015 at 16:36 Comment(0)
T
4

You need to escape all the XML entities before parsing the file into a Document. You do that by escaping the ampersand & itself with its corresponding XML entity &amp;. Something like,

DocumentBuilder documentBuilder =
        DocumentBuilderFactory.newInstance().newDocumentBuilder();

String xmlContents = new String(Files.readAllBytes(Paths.get("demo.xml")), "UTF-8");

Document document = documentBuilder.parse(
         new InputSource(new StringReader(xmlContents.replaceAll("&", "&amp;"))
        ));

Output :

2A string followed by special symbols &#13;  &#10;
Tobar answered 18/3, 2015 at 17:32 Comment(3)
Thank you very much, this works. An additional question, I tried to solve this by simply adding: documentFactory.setExpandEntityReferences(false); which from what I understand should not expand these special symbols at all, but it did not change the output. Do you know why?Pean
Numeric character references are not entity references, though they use the same &; delimiters. As far as xml is concerned, there is no difference between a numeric character reference and the Unicode character it refers to. An alternative to escaping every ampersand would be to use a <![CDATA[]]> section ... but generally EITHER of these means you're trying to solve the wrong problem, and should instead be asking yourself why normal xml markup can't be used.Classified
@Pean Keshlam is correct. The technical term for &#nnnn; is numeric character references (NCR). Only, &quot/amp/apos/lt/gt; are considered XML character entities (More info at en.wikipedia.org/wiki/Character_entity_reference). So, perhaps this answers why setExpandEntityReferences(false) didn't have any effect. But fortunately, escaping the ampersand would work for both of them.Tobar
G
3

P.S. This is complement of Ravi Thapliyal's answer, not an alternative.

I am having the same problem with handling an XML file which is exported from 2003 format Excelsheet. This XML file stores line-breaks in text contents as &#10; along with other numeric character references. However, after reading it with Java DOM parser, manipulating the content of some elements and transforming it back to the XML file, I see that all the numeric character references are expanded (i.e. The line-break is converted to CRLF) in Windows with J2SE1.6. Since my goal is to keep the content format unchanged as much as possible while manipulating some elements (i.e. retain numeric character references), Ravi Thapliyal's suggestion seems to be the only working solution.

When writing the XML content back to the file, it is necessary to replace all &amp; with &, right? To do that, I had to give a StringWriter to the transformer as StreamResult and obtain String from it, replace all and dump the string to the xml file.

TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
DOMSource source = new DOMSource(document);

//write into a stringWriter for further processing.
StringWriter stringWriter = new StringWriter();
StreamResult result = new StreamResult(stringWriter);

t.transform(source, result);

//stringWriter stream contains xml content.
String xmlContent = stringWriter.getBuffer().toString();
//revert "&amp;" back to "&" to retain numeric character references.
xmlContent = xmlContent.replaceAll("&amp;", "&");

BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));
wr.write(xmlContent);
wr.close();
Grub answered 23/3, 2015 at 7:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.