I needed to accomplish the same task. I have solved it with two xslt.
Just let me stress that this will only work if the CDATA
is well-formed xml.
To be complete, let me add to your example xml a root element:
<root>
<well-formed-content><![CDATA[ Some Text <p>more text and tags</p>]]>
</well-formed-content>
</root>
Fig 1.- Starting xml
First step
In the first transformation step, I have wrapped all text nodes in a new introduced xml entity old_text
:
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no" version="1.0"
encoding="UTF-8" standalone="yes" />
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="*|text()|@*|comment()|processing-instruction()" />
</xsl:copy>
</xsl:template>
<!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
<xsl:template match="@*|comment()|processing-instruction()">
<xsl:copy-of select="." />
</xsl:template>
<!-- Text-nodes: Wrap them in a new node without escaping it. -->
<!-- (note precondition: CDATA should be valid xml. -->
<xsl:template match="text()">
<xsl:element name="old_text">
<xsl:value-of select="." disable-output-escaping="yes" />
</xsl:element>
</xsl:template>
</xsl:stylesheet>
Fig 2.- First xslt (wrapping CDATA in "old_text" elements)
If you apply this transformation to the starting xml this is what you get (I'm not reformatting it to avoid confusion about who does what):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><old_text>
</old_text><well-formed-content><old_text> Some Text <p>more text and tags</p>
</old_text></well-formed-content><old_text>
</old_text></root>
Fig 3.- Transformed xml (first step)
Second step
You now need to clean-up the introduced old_text
elements, and re-escape the text that didn't create new nodes:
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no" version="1.0"
encoding="UTF-8" standalone="yes" />
<!-- Element-nodes: Process nodes and their children -->
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="*|text()|@*|comment()" />
</xsl:copy>
</xsl:template>
<!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
<xsl:template match="@*|comment()">
<xsl:copy-of select="." />
</xsl:template>
<!--
'Wrapper'-node: remove the wrapper element but process its children.
With this matcher, the "old_text" is cleaned, but the originally CDATA
well-formed nodes surface in the resulting xml.
-->
<xsl:template match="old_text">
<xsl:apply-templates select="*|text()" />
</xsl:template>
<!--
Text-nodes: Text here comes from original CDATA and must be now
escaped. Note that the previous rule has extracted all the existing
nodes in the CDATA. -->
<xsl:template match="text()">
<xsl:value-of select="." disable-output-escaping="no" />
</xsl:template>
</xsl:stylesheet>
Fig 4.- 2nd xslt (cleaned-up artificially-introduced elements)
Result
This is the final result, with the nodes that originally where in CDATA expanded in your new xml file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root>
<well-formed-content> Some Text <p>more text and tags</p>
</well-formed-content>
</root>
Fig 5.- Final xml
Caveat
If your CDATA contains html character entities not supported in xml (take a look for examples at this wikipedia article about character entities), you need to add those references to your intermediate xml. Let me show this with an example:
<root>
<well-formed-content>
<![CDATA[ Some Text <p>more text and tags</p>,
now with a non-breaking-space before the stop: .]]>
</well-formed-content>
</root>
Fig 6.- Added character entity
to xml in Fig 1
The original xslt from Fig 2 will convert the xml into this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><old_text>
</old_text><well-formed-content><old_text>
Some Text <p>more text and tags</p>,
now with a non-breaking-space before the stop: .
</old_text></well-formed-content><old_text>
</old_text></root>
Fig 7.- Result of a first try to convert the xml in Fig 6 (Not well-formed!)
The problem with this file is that it is not well-formed, and thus, cannot be further processed with a XSLT-processor:
The entity "nbsp" was referenced, but not declared.
XML checking finished.
Fig 8.- Result of the well-formedness checking for the xml in Fig 7
This workaround does the trick (the match="/"
template adds the
entity):
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no" version="1.0"
encoding="UTF-8" standalone="yes" />
<!-- Add an html entity to the xml character entities declaration. -->
<xsl:template match="/">
<xsl:text disable-output-escaping="yes"><![CDATA[<!DOCTYPE root
[
<!ENTITY nbsp " ">
]>
]]>
</xsl:text>
<xsl:apply-templates select="*" />
</xsl:template>
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="*|text()|@*|comment()|processing-instruction()" />
</xsl:copy>
</xsl:template>
<!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
<xsl:template match="@*|comment()|processing-instruction()">
<xsl:copy-of select="." />
</xsl:template>
<!-- Text-nodes: Wrap them in a new node without escaping it. -->
<!-- (note precondition: CDATA should be valid xml. -->
<xsl:template match="text()">
<xsl:element name="old_text">
<xsl:value-of select="." disable-output-escaping="yes" />
</xsl:element>
</xsl:template>
</xsl:stylesheet>
Fig 9.- The xslt creates the entity declaration
Now, after applying this xslt to the Fig 6 source xml, this is the intermediate xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><!DOCTYPE root
[
<!ENTITY nbsp " ">
]>
<root><old_text>
</old_text><well-formed-content><old_text>
Some Text <p>more text and tags</p>,
now with a non-breaking-space before the stop: .
</old_text></well-formed-content><old_text>
</old_text></root>
Fig 10.- Intermediate xml (xml from Fig 3 plus entity declaration)
You can use the xslt transformation from Fig 4 to produce the final xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root>
<well-formed-content>
Some Text <p>more text and tags</p>,
now with a non-breaking-space before the stop: .
</well-formed-content>
</root>
Fig 11.- Final xml with html entites converted to UTF-8
Notes
For these examples I have used NetBeans 7.1.2 built-in XSLT processor (com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl - default JRE XSLT processor
)
Disclaimer: I'm not an XML expert. I have the feeling that this should be even easier...