How do I write unescaped XML outside of a CDATA
Asked Answered
W

7

9

I am trying to write XML data using Stax where the content itself is HTML

If I try

xtw.writeStartElement("contents");
xtw.writeCharacters("<b>here</b>");
xtw.writeEndElement();

I get this

<contents>&lt;b&gt;here&lt;/b&gt;</contents>

Then I notice the CDATA method and change my code to:

xtw.writeStartElement("contents");
xtw.writeCData("<b>here</b>");
xtw.writeEndElement();

and this time the result is

<contents><![CDATA[<b>here</b>]]></contents>

which is still not good. What I really want is

<contents><b>here</b></contents>

So is there an XML API/Library that allows me to write raw text without being in a CDATA section? So far I have looked at Stax and JDom and they do not seem to offer this.

In the end I might resort to good old StringBuilder but this would not be elegant.

Update:

I agree mostly with the answers so far. However instead of <b>here</b> I could have a 1MB HTML document that I want to embed in a bigger XML document. What you suggest means that I have to parse this HTML document in order to understand its structure. I would like to avoid this if possible.

Answer:

It is not possible, otherwise you could create invalid XML documents.

Wife answered 8/6, 2010 at 10:15 Comment(2)
If this were possible then you could far too easily write invalid XML files. Not that most real-world HTML (that is not XHTML) is not valid XML (far too many unclosed tags and unescaped attributes). All of that is fine for HTML, but not allowed for XML, so using CDATA is really the only correct thing to do, unless your HTML is actually XHTML.Penetralia
@Joachim. Yes in my case it is XHTML. That is why I know it is valid and I want to embed it straight away without any processing.Wife
M
3

The issue is that is not raw text it is an element so you should be writing

xtw.writeStartElement("contents");
xtw.writeStartElement("b");
xtw.writeCData("here");
xtw.writeEndElement();
xtw.writeEndElement();
Molybdenite answered 8/6, 2010 at 10:20 Comment(1)
I think the problem is that the he has a blob which MAY contain tags.Dither
B
1

If you want the XML to be included AS XML and not as character data, then it has to be parsed at some point. If you don't want to manually do the parsing yourself, you have two alternatives:

(1) Use external parsed entities -- in this case the external file will be pulled in and parsed by the XML parser. When the output is again serialized, it will include the contents of the external file.

[ See http://www.javacommerce.com/displaypage.jsp?name=entities.sql&id=18238 ]

(2) Use Xinclude -- in that case the file has to be run thru an xinclude processor which will merge the xinclude references into the output. Most xslt processors, as well as xmllint will also do xinclude with an appropriate option.

[ See: http://www.xml.com/pub/a/2002/07/31/xinclude.html ]

( XSLT can also be used to merge documents without using the XInclude syntax. XInclude just provides a standard syntax )

Bilbo answered 8/6, 2010 at 17:31 Comment(0)
A
0

The problem is not "here", it's <b></b>.

Add the <b> element as a child of contents and you'll be able to do it. Any library like JDOM or DOM4J will allow you to do this. The general case is to parse the content into an XML DOM and add the root element as a child of <contents>.

You can't add escaped values outside of a CDATA section.

Aphotic answered 8/6, 2010 at 10:19 Comment(0)
P
0

If your XML and HTML are not too big, you could make a workaround:

xtw.writeStartElement("contents");
xtw.writeCharacters("anUniqueIdentifierForReplace"); // <--
xtw.writeEndElement();

When you have your XML as a String:

xmlAsString.replace("anUniqueIdentifierForReplace", yourHtmlAsString);

I know, it's not so nice, but this could work.


Edit: Of course, you should check if yourHtmlAsString is valid.

Pydna answered 8/6, 2010 at 10:36 Comment(4)
This is actually a very unclever hack. If you don't want the XML writer to produce a valid XML document, use String concatenation to begin with instead.Catharine
If you know that you have valid XML to enter as a blob this would work but you are taking a risk that it is all well formed.Molybdenite
Ok! Ok! I won't use this. No need to downvote Daniel any more.Wife
Any computer program obeys the "Garbage In Garbage Out" rule. This solution is no worse. Either you have valid input, and then this solution is more efficient as the others, or you don't, in which case all solutions proposed here fail to produce valid XML output. So, this solution is strictly better.Smack
K
0

If you want to embed a large HTML document in an XML document then CDATA imho is the way to go. That way you don't have to understand or process the internal structure and you can later change the document type from HTML to something else without much hassle. Also I think you can't embed e.g. DOCTYPE instructions directly (i.e. as structured data that retains the semantics of the DOCTYPE instruction). They have to be represented as characters.

(This is primarily a response to your update but alas I don't have enough rep to comment...............)

Kruller answered 8/6, 2010 at 10:44 Comment(0)
R
0

I don't see what the problem is with parsing the large block of XML you want to insert into your output. Use a StAX parser to parse it, and just write code to forward all of the events to your existing serializer (variable "xtw").

Refrigerant answered 8/6, 2010 at 18:41 Comment(0)
P
0

If the blob of html is actually xhtml then I'd suggest doing something like (in pseudo-code):

xtw.writeStartElement("contents")
XMLReader  xtr=new XMLReader();
xtr.read(blob);
Dom dom=xtr.getDom();
for(element e:dom){
    xtw.writeElement(e);
}
xtw.writeEndElement();

or something like that. I had to do something similar once but used a different library.

Phlegethon answered 8/6, 2010 at 20:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.