Transform XML with XSLT and preserve CDATA (in Ruby)

Asked 1/10, 2009 at 5:42 Answered 30/8, 2018 at 22:58

I am trying to convert a document with content like the following into another document, leaving the CDATA exactly as it was in the first document, but I haven't figured out how to preserve the CDATA with XSLT.

Initial XML:

<node>
    <subNode>
        <![CDATA[ HI THERE ]]>
    </subNode>
    <subNode>
        <![CDATA[ SOME TEXT ]]>
    </subNode>
</node>

Final XML:

<newDoc>
    <data>
        <text>
            <![CDATA[ HI THERE ]]>
        </text>
        <text>
            <![CDATA[ SOME TEXT ]]>
        </text>
    </data>
</newDoc>

I've tried something like this, but no luck, everything gets jumbled:

<xsl:element name="subNode">
    <xsl:value-of select="." disable-output-escaping="yes"/>
</xsl:element>

Any ideas how to preserve the CDATA?

Thanks! Lance

Using ruby/nokogiri

Update: Here's something that works.

<text disable-output-escaping="yes">&lt;![CDATA[</text>
<value-of select="normalize-space(text())" disable-output-escaping="yes"/>
<text disable-output-escaping="yes">]]&gt;</text>

That will wrap all text() nodes in CDATA, which works for what I need, and it will preserve html tags inside the text.

Vasectomy answered 1/10, 2009 at 5:42 Comment(0)

You cannot preserve the precise sequence of CDATA nodes if they're mixed with plain text nodes. At best, you can force all content of a particular element in the output to be CDATA, by listing that element name in xsl:output/@cdata-section-elements:

<xsl:output cdata-section-elements="text"/>

Viridis answered 1/10, 2009 at 6:36 Comment(3)

Should I just use ruby and maybe regular expressions to preprocess them before I do the xslt, or something along those lines? How else would you do that? The cdata-section-elements isn't quite cutting it because I'm using variables and such. Thanks for the tip. – Vasectomy 1/10, 2009 at 7:17

If you absolutely need CDATA, then you'll have to look for something other than XSLT. That said, I'm very curious as to the reason why you need it. XDM doesn't distinguish between text and CDATA for a very good reason - no sane XML-processing application should ever give different semantics for them, so CDATA and character-escaping should be useable interchangeably. – Viridis 1/10, 2009 at 17:24

I am using this data in Flash, and I have heard there's lots of problems with CDATA/no CDATA. I haven't really tried yet tho :p – Vasectomy 5/10, 2009 at 8:23

Sorry to post an answer to my own question, but I found something that works:


<text disable-output-escaping="yes">&lt;![CDATA[</text>
<value-of select="normalize-space(text())" disable-output-escaping="yes"/>
<text disable-output-escaping="yes">]]&gt;</text>

That will wrap all text() nodes in CDATA, which works for what I need, and it will preserve html tags inside the text.

Vasectomy answered 5/10, 2009 at 8:28 Comment(1)

I guess it's a way to get CDATA node specifically in the output (except that you can get ]]> in input text(), in which case it won't quite do what you expect), but I don't see how this would let you preserve CDATA nodes that were there in the first place, since you still have no way of distinguishing input text nodes from input CDATA nodes. Otherwise, I don't see how this is any different than cdata-section-elements... – Viridis 5/10, 2009 at 15:24

I found this article while trying to solve a similar problem (using an XSL transform to take one XML file and create a partial/subset copy of some of the nodes in it, as a second XML file). In my case the first XML files have some elements whose values are entirely wrapped in CDATA blocks, because they happen to be JSON and they carry some HTML formatting markup.

What I found was that rather than using xsl:value-of, I could use xsl:copy-of, and just as @Pavel Minaev points out, I could keep the original CDATA intact by listing every relevant element name in the xsl:output declaration. This might be an approach that would work for the OP.

XML to be copied (sample):

<text_item>
  <id>100</id>
  <stem_text><![CDATA[(any string of text, including HTML)]]></stem_text>
  <answerOptions><![CDATA[{"choices":[{"label":"Atmospheric O<sub>2</sub>",
   "value":"A"},{"label":"Released CO<sub>2</sub>",
   "value":"B"}]}]]></answerOptions>
 ...
</text_item>

Relevant stylesheet lines:

<xsl:output method="xml" indent="yes" cdata-section-elements="stem_text answerOptions" />
...
<xsl:apply-templates select="//text_item" >
...
<xsl:template match="text_item">
    <xsl:element name="text_item" >
        <xsl:copy-of select="node()"  />
    </xsl:element>
</xsl:template>

The cdata-section-elements attribute means that in the output, the original CDATA blocks in the XML copied from will be passed through, as-is, to the output XML file when the transform runs. It appears that you can name as many elements as you want.

In the OP's example, I believe he would select on //node/subNode and then build an element named text, inside newDoc/data of course. His cdata-section-elements attribute would be simply ="text", exactly as Pavel has it.

Ditmore answered 30/8, 2018 at 22:58 Comment(0)

Recommended topics

Hot tags