Handling change in newlines by XML transformation for CDATA from Java 8 to Java 11

Asked 25/4, 2019 at 15:51 Answered 25/2, 2023 at 22:51

Solved java xml transformation sax java-11

With Java 9 there was a change in the way javax.xml.transform.Transformer with OutputKeys.INDENT handles CDATA tags. In short, in Java 8 a tag named 'test' containing some character data would result in:

<test><![CDATA[data]]></test>

But with Java 9 the same results in

<test>
    <![CDATA[data]]>
</test>

Which is not the same XML.

I understood (from a source no longer available) that for Java 9 there was a workaround using a DocumentBuilderFactory with setIgnoringElementContentWhitespace=true but this no longer works for Java 11.

Does anyone know a way to deal with this in Java 11? I'm either looking for a way to prevent the extra newlines (but still be able to format my XML), or be able to ignore them when parsing the XML (preferably using SAX).

Unfortunately I don't know what the CDATA tag will actually contain in my application. It might begin or end with white space or newlines so I can't just strip them when reading the XML or actually setting the value in the resulting object.

Sample program to demonstrate the issue:

public static void main(String[] args) throws TransformerException, ParserConfigurationException, IOException, SAXException
{
    String data = "data";

    StreamSource source = new StreamSource(new StringReader("<foo><bar><![CDATA[" + data + "]]></bar></foo>"));
    StreamResult result = new StreamResult(new StringWriter());

    Transformer tform = TransformerFactory.newInstance().newTransformer();
    tform.setOutputProperty(OutputKeys.INDENT, "yes");
    tform.transform(source, result);

    String xml = result.getWriter().toString();

    System.out.println(xml); // I expect bar and CDATA to be on same line. This is true for Java 8, false for Java 11


    Document document = DocumentBuilderFactory.newInstance()
        .newDocumentBuilder()
        .parse(new InputSource(new StringReader(xml)));

    String resultData = document.getElementsByTagName("bar")
        .item(0)
        .getTextContent();

    System.out.println(data.equals(resultData)); // True for Java 8, false for Java 11
}

EDIT: For future reference, I've submitted a bug report to Oracle, and this is fixed in Java 14: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8223291

Elijah answered 25/4, 2019 at 15:51 Comment(1)

You should edit your question and add a sample Java code that demonstrates the problem (generate a small XML + transform). It is a lot easier to start with a working example. – Connect 26/4, 2019 at 18:1

As your code relies on unspecified behavior, extra explicit code seems better:

You want indentation like:

  tform.setOutputProperty(OutputKeys.INDENT, "yes");
  tform.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");

However not for elements containing a CDATA tag:

  String xml = result.getWriter().toString();
  // No indentation (whitespace) for elements with a CDATA section.
  xml = xml.replaceAll("(?s)>\\s*(<\\!\\[CDATA\\[.*?]]>)\\s*</", ">$1</");

The regex uses:

(?s) DOT_ALL to have . match any character, also newline characters.
.*? the shortest matching sequence, to not match "...]]>...]]>".

Alternatively: In a DOM tree (preserving CDATA) you can retrieve all CDATA sections per XPath, and remove whitespace siblings using the parent element.

Aero answered 29/4, 2019 at 8:48 Comment(6)

Thanks! That's actually a pretty clean workaround. I am wondering what you mean by my code relying on unspecified behavior? – Elijah 29/4, 2019 at 13:40

You are telling that the transformation should do a pretty-print; indent every element. But the newest java version does indeed that: indenting also CDATA sections. So that reeks of an earlier exception made for CDATA. In every case one cannot find fault with the specification. – Aero 29/4, 2019 at 13:44

Well, CDATA can be followed by 'normal' data. For example, this is valid: <test><![CDATA[data]]>foo</test>. By adding additional whitespace, the contents of the XML change. So I do think this is an issue with the Transformer. – Elijah 29/4, 2019 at 13:55

Then why INDENT=yes? One can restrict in DTD/XSD the allowed content, but I do not think that plays a role here (or validation in general). Would INDENT="no" not suffice, if you are reading in a DOM afterwards. – Aero 29/4, 2019 at 14:3

The issue with CDATA has been fixed in Java 14. I test it in the ea version: openjdk version "14-ea" 2020-03-17 OpenJDK Runtime Environment (build 14-ea+6-171) – Noodlehead 28/7, 2019 at 4:38

Verified that it indeed works with the ea version of OpenJDK 14. Thanks! – Elijah 29/7, 2019 at 14:4

The solution from Joop Eggen is brilliant.

I just want to expand the solution a little bit.

xml = xml.replaceAll(">\\s*(<\\!\\[CDATA\\[(.|\\n|\\r\\n)*?]\\]>)\\s*</", ">$1</");

In this regex I include the possibility that inside the CDATA tag new lines are allowed. So I am testing for \n and also windows-style \r\n

XML Example:

<test>
   <![CDATA[com.foo.test]]>
</test
<test>
 <![CDATA[2st Line   
2nd Line]]>
</test>

Spiers answered 25/2, 2023 at 22:51 Comment(2)

Joop Eggen mentions prefixingthe regex with (?s) to make .* match newlines. While he did not actually include it in the regex in his answer, I think I used it to solve my problem at the time. – Elijah 26/2, 2023 at 21:54

I have edited Joop Eggen's answer to include the (?s) in the regex, I'll leave it up to future readers to decide which regex they prefer to use :) – Elijah 27/2, 2023 at 8:24

Recommended topics

Hot tags