Handling change in newlines by XML transformation for CDATA from Java 8 to Java 11
Asked Answered
E

2

14

With Java 9 there was a change in the way javax.xml.transform.Transformer with OutputKeys.INDENT handles CDATA tags. In short, in Java 8 a tag named 'test' containing some character data would result in:

<test><![CDATA[data]]></test>

But with Java 9 the same results in

<test>
    <![CDATA[data]]>
</test>

Which is not the same XML.

I understood (from a source no longer available) that for Java 9 there was a workaround using a DocumentBuilderFactory with setIgnoringElementContentWhitespace=true but this no longer works for Java 11.

Does anyone know a way to deal with this in Java 11? I'm either looking for a way to prevent the extra newlines (but still be able to format my XML), or be able to ignore them when parsing the XML (preferably using SAX).

Unfortunately I don't know what the CDATA tag will actually contain in my application. It might begin or end with white space or newlines so I can't just strip them when reading the XML or actually setting the value in the resulting object.

Sample program to demonstrate the issue:

public static void main(String[] args) throws TransformerException, ParserConfigurationException, IOException, SAXException
{
    String data = "data";

    StreamSource source = new StreamSource(new StringReader("<foo><bar><![CDATA[" + data + "]]></bar></foo>"));
    StreamResult result = new StreamResult(new StringWriter());

    Transformer tform = TransformerFactory.newInstance().newTransformer();
    tform.setOutputProperty(OutputKeys.INDENT, "yes");
    tform.transform(source, result);

    String xml = result.getWriter().toString();

    System.out.println(xml); // I expect bar and CDATA to be on same line. This is true for Java 8, false for Java 11


    Document document = DocumentBuilderFactory.newInstance()
        .newDocumentBuilder()
        .parse(new InputSource(new StringReader(xml)));

    String resultData = document.getElementsByTagName("bar")
        .item(0)
        .getTextContent();

    System.out.println(data.equals(resultData)); // True for Java 8, false for Java 11
}

EDIT: For future reference, I've submitted a bug report to Oracle, and this is fixed in Java 14: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8223291

Elijah answered 25/4, 2019 at 15:51 Comment(1)
You should edit your question and add a sample Java code that demonstrates the problem (generate a small XML + transform). It is a lot easier to start with a working example.Connect
A
5

As your code relies on unspecified behavior, extra explicit code seems better:

  • You want indentation like:

      tform.setOutputProperty(OutputKeys.INDENT, "yes");
      tform.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
    
  • However not for elements containing a CDATA tag:

      String xml = result.getWriter().toString();
      // No indentation (whitespace) for elements with a CDATA section.
      xml = xml.replaceAll("(?s)>\\s*(<\\!\\[CDATA\\[.*?]]>)\\s*</", ">$1</");
    

The regex uses:

  • (?s) DOT_ALL to have . match any character, also newline characters.
  • .*? the shortest matching sequence, to not match "...]]>...]]>".

Alternatively: In a DOM tree (preserving CDATA) you can retrieve all CDATA sections per XPath, and remove whitespace siblings using the parent element.

Aero answered 29/4, 2019 at 8:48 Comment(6)
Thanks! That's actually a pretty clean workaround. I am wondering what you mean by my code relying on unspecified behavior?Elijah
You are telling that the transformation should do a pretty-print; indent every element. But the newest java version does indeed that: indenting also CDATA sections. So that reeks of an earlier exception made for CDATA. In every case one cannot find fault with the specification.Aero
Well, CDATA can be followed by 'normal' data. For example, this is valid: <test><![CDATA[data]]>foo</test>. By adding additional whitespace, the contents of the XML change. So I do think this is an issue with the Transformer.Elijah
Then why INDENT=yes? One can restrict in DTD/XSD the allowed content, but I do not think that plays a role here (or validation in general). Would INDENT="no" not suffice, if you are reading in a DOM afterwards.Aero
The issue with CDATA has been fixed in Java 14. I test it in the ea version: openjdk version "14-ea" 2020-03-17 OpenJDK Runtime Environment (build 14-ea+6-171)Noodlehead
Verified that it indeed works with the ea version of OpenJDK 14. Thanks!Elijah
S
1

The solution from Joop Eggen is brilliant.

I just want to expand the solution a little bit.

xml = xml.replaceAll(">\\s*(<\\!\\[CDATA\\[(.|\\n|\\r\\n)*?]\\]>)\\s*</", ">$1</");

In this regex I include the possibility that inside the CDATA tag new lines are allowed. So I am testing for \n and also windows-style \r\n

XML Example:

<test>
   <![CDATA[com.foo.test]]>
</test
<test>
 <![CDATA[2st Line   
2nd Line]]>
</test>
Spiers answered 25/2, 2023 at 22:51 Comment(2)
Joop Eggen mentions prefixingthe regex with (?s) to make .* match newlines. While he did not actually include it in the regex in his answer, I think I used it to solve my problem at the time.Elijah
I have edited Joop Eggen's answer to include the (?s) in the regex, I'll leave it up to future readers to decide which regex they prefer to use :)Elijah

© 2022 - 2024 — McMap. All rights reserved.