How to preserve newlines in CDATA when generating XML?
Asked Answered
W

5

7

I want to write some text that contains whitespace characters such as newline and tab into an xml file so I use

Element element = xmldoc.createElement("TestElement");
element.appendChild(xmldoc.createCDATASection(somestring));

but when I read this back in using

Node vs =  xmldoc.getElementsByTagName("TestElement").item(0);
String x = vs.getFirstChild().getNodeValue();

I get a string that has no newlines anymore.
When i look directly into the xml on disk, the newlines seem preserved. so the problem occurs when reading in the xml file.

How can I preserve the newlines?

Thanks!

Warren answered 1/8, 2009 at 15:52 Comment(7)
Could you post a more complete code example?Daimon
it is a Element. i will post more code soon.Warren
when you get the value of 'x', it is equivalent to 'somestring' minus the newlines?Worrywart
have you tried escaping the backslash on your \n to make it \\n?Worrywart
well, when i look directly into the xml on disk, the newlines seem preserved. so the problem occurs when reading in the xml. sorry i didnt tell this earlier. i will add it to my post.Warren
What newline character is being used? A shot in the dark, but I wonder if it has something to do with how newlines are supported: w3.org/TR/REC-xml/#sec-line-endsLindstrom
@McDowll, how can i find out what newline character is used? i have the xmlfile on disk, where the newline look fine.Warren
B
5

I don't know how you parse and write your document, but here's an enhanced code example based on yours:

// creating the document in-memory                                                        
Document xmldoc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();

Element element = xmldoc.createElement("TestElement");                                    
xmldoc.appendChild(element);                                                              
element.appendChild(xmldoc.createCDATASection("first line\nsecond line\n"));              

// serializing the xml to a string                                                        
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();             

DOMImplementationLS impl =                                                                
    (DOMImplementationLS)registry.getDOMImplementation("LS");                             

LSSerializer writer = impl.createLSSerializer();                                          
String str = writer.writeToString(xmldoc);                                                

// printing the xml for verification of whitespace in cdata                               
System.out.println("--- XML ---");                                                        
System.out.println(str);                                                                  

// de-serializing the xml from the string                                                 
final Charset charset = Charset.forName("utf-16");                                        
final ByteArrayInputStream input = new ByteArrayInputStream(str.getBytes(charset));       
Document xmldoc2 = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(input);

Node vs =  xmldoc2.getElementsByTagName("TestElement").item(0);                           
final Node child = vs.getFirstChild();                                                    
String x = child.getNodeValue();                                                          

// print the value, yay!                                                                  
System.out.println("--- Node Text ---");                                                  
System.out.println(x);                                                                    

The serialization using LSSerializer is the W3C way to do it (see here). The output is as expected, with line separators:

--- XML --- 
<?xml version="1.0" encoding="UTF-16"?>
<TestElement><![CDATA[first line
second line ]]></TestElement>
--- Node Text --- 
first line
second line
Bluefield answered 8/8, 2009 at 11:43 Comment(5)
thank you, i tried that but it doesnt work for me. while i can see the linebreaks are there in the xmlfile on disk, once i read them back with this code, they are gone. maybe my linebreak character is bad. how can i find out, which one it is?Warren
The output I showed is a real output from my own machine of the code example I posted. Did you try writing the text with the code I suggested? Or only to read it using my code? Also, what is the encoding of your file (you can see that in my example, the encoding is UTF-16). I had a similar problem by not using the same encoding, and I fixed it by using Charset.forName() with the actual encoding used.Bluefield
yep, i did try your actual code in my case. i used exactly the same code to output the string. but it does not contain whitespaces. the encoding i use is encoding="ISO-8859-1" i will try to use UTF-16Warren
If you use exactly the same code with ISO-8859-1, you will have problems - unless you change the Charset.forName to use ISO-8859-1. New-lines can be problematic between ASCII and UTF-16, so its worth a shot.Bluefield
I have tried with "utf-16" while de-serializing, but in my case .. it is converting "\r\n" with "\n\n" which cause issue to me. Can you please provide solution to this problem.Clavicle
H
2

You need to check the type of each node using node.getNodeType(). If the type is CDATA_SECTION_NODE, you need to concat the CDATA guards to node.getNodeValue.

Halfassed answered 1/8, 2009 at 16:16 Comment(1)
yes, the type of the node is CDATA. but what do you mean with concat CDATA guards?Warren
E
2

You don't necessarily have to use CDATA to preserve white space characters. The XML specification specify how to encode these characters.

So for example, if you have an element with value that contains new space you should encode it with

  &#xA;

Carriage return:

 &#xD;

And so forth

Emmet answered 1/8, 2009 at 16:48 Comment(1)
thanks, but is there a way without encoding it? so that i can see the formatted text in the xml file itself?Warren
L
0

EDIT: cut all the irrelevant stuff

I'm curious to know what DOM implementation you're using, because it doesn't mirror the default behaviour of the one in a couple of JVMs I've tried (they ship with a Xerces impl). I'm also interested in what newline characters your document has.

I'm not sure if whether CDATA should preserve whitespace is a given. I suspect that there are many factors involved. Don't DTDs/schemas affect how whitespace is processed?

You could try using the xml:space="preserve" attribute.

Lindstrom answered 1/8, 2009 at 16:15 Comment(1)
thanks, where exactly should i add that xml:space="preserve" attribute? to the node that contains the text or to the xml root?Warren
M
0

xml:space='preserve' is not it. That is only for "all whitespace" nodes. That is, if you want the whitespace nodes in

<this xml:space='preserve'> <has/>
<whitespace/>
</this>

But see that those whitespace nodes are ONLY whitespace.

I have been struggling to get Xerces to generate events allowing isolation of CDATA content as well. I have no solution as yet.

Merodach answered 13/12, 2014 at 6:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.