java: remove cdata tag from xml
Asked Answered
R

5

8

xpath is nice for parsing xml files, but its not working for data inside the cdata tag:

<![CDATA[ Some Text <p>more text and tags</p>... ]]>

My solution: Get the content of the xml first and remove

"<![CDATA["  and  "]]>".

After that I would run xpath "to reach everything" from the xml file. Is there a better solution? If not, how can I do it with a regular expression?

Rowdyish answered 26/7, 2011 at 21:17 Comment(3)
removing CDATA may render your xml invalid (and maybe useless for processing purposes)Syphon
Regex and XML DO NOT MIX. Please read stackoverflow.com/questions/1732348Ancier
So what would be the solution to get informations like title, description, pubtime and at the same time cdata content from a rss xml file? It's actualle the image link that I need from CDATA.Rowdyish
K
2

The reason for the CDATA tags there is that everything inside them is pure text, nothing which should be interpreted directly as XML. You could write your document fragment in the question alternatively as

 Some Text &lt;p&gt;more text and tags&lt;/p&gt;... 

(with a leading and trailing space).

If you really want to interpret this as XML, extract the text from your document, and submit it to an XML parser again.

Ker answered 26/7, 2011 at 22:18 Comment(2)
I'm curious if you're suggesting something simpler as my proposed answer?Backwoodsman
Not really ... I just said that the problem normally shouldn't exist, as the content in the CDATA area is not meant to be interpreted as XML.Sunday
R
2

To strip the CDATA and keep the tags as tags, you could use XSLT.

Given this XML input:

<?xml version="1.0" encoding="ISO-8859-1"?>
<root>
    <child>Here is some text.</child>
    <child><![CDATA[Here is more text <p>with tags</p>.]]></child>
</root>

Using this XSLT:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">

    <xsl:output method="xml" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*" />
            <xsl:value-of select="text()" disable-output-escaping="yes"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Will return the following XML:

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <child>Here is some text.</child>
   <child>Here is more text <p>with tags</p>.</child>
</root>

(Tested with Saxon HE 9.3.0.5 in oXygen 12.2)

Then you could use xPath to extract the contents of the p element:

/root/child/p
Riesling answered 20/8, 2012 at 17:58 Comment(0)
B
1

I needed to accomplish the same task. I have solved it with two xslt.

Just let me stress that this will only work if the CDATA is well-formed xml.

To be complete, let me add to your example xml a root element:

<root>
   <well-formed-content><![CDATA[ Some Text <p>more text and tags</p>]]>
   </well-formed-content>
</root>

Fig 1.- Starting xml


First step

In the first transformation step, I have wrapped all text nodes in a new introduced xml entity old_text:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="no" version="1.0"
    encoding="UTF-8" standalone="yes" />

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*|text()|@*|comment()|processing-instruction()" />
        </xsl:copy>
    </xsl:template>

    <!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
    <xsl:template match="@*|comment()|processing-instruction()">
        <xsl:copy-of select="." />
    </xsl:template>

    <!-- Text-nodes: Wrap them in a new node without escaping it. -->
    <!-- (note precondition: CDATA should be valid xml.           -->
    <xsl:template match="text()">
        <xsl:element name="old_text">
            <xsl:value-of select="." disable-output-escaping="yes" />
        </xsl:element>
    </xsl:template>

</xsl:stylesheet>

Fig 2.- First xslt (wrapping CDATA in "old_text" elements)

If you apply this transformation to the starting xml this is what you get (I'm not reformatting it to avoid confusion about who does what):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><old_text>
    </old_text><well-formed-content><old_text> Some Text <p>more text and tags</p>
    </old_text></well-formed-content><old_text>
</old_text></root>

Fig 3.- Transformed xml (first step)


Second step

You now need to clean-up the introduced old_text elements, and re-escape the text that didn't create new nodes:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="no" version="1.0"
    encoding="UTF-8" standalone="yes" />

    <!-- Element-nodes: Process nodes and their children -->
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*|text()|@*|comment()" />
        </xsl:copy>
    </xsl:template>

    <!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
    <xsl:template match="@*|comment()">
        <xsl:copy-of select="." />
    </xsl:template>

    <!--
        'Wrapper'-node: remove the wrapper element but process its children.
        With this matcher, the "old_text" is cleaned, but the originally CDATA
        well-formed nodes surface in the resulting xml.
    -->
    <xsl:template match="old_text">
        <xsl:apply-templates select="*|text()" />
    </xsl:template>

    <!--
        Text-nodes: Text here comes from original CDATA and must be now
        escaped. Note that the previous rule has extracted all the existing
        nodes in the CDATA. -->
    <xsl:template match="text()">
        <xsl:value-of select="." disable-output-escaping="no" />
    </xsl:template>

</xsl:stylesheet>

Fig 4.- 2nd xslt (cleaned-up artificially-introduced elements)


Result

This is the final result, with the nodes that originally where in CDATA expanded in your new xml file:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root>
    <well-formed-content> Some Text <p>more text and tags</p>
    </well-formed-content>
</root>

Fig 5.- Final xml


Caveat

If your CDATA contains html character entities not supported in xml (take a look for examples at this wikipedia article about character entities), you need to add those references to your intermediate xml. Let me show this with an example:

<root>
    <well-formed-content>
        <![CDATA[ Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop:&nbsp;.]]>
    </well-formed-content>
</root>

Fig 6.- Added character entity &nbsp; to xml in Fig 1

The original xslt from Fig 2 will convert the xml into this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><old_text>
    </old_text><well-formed-content><old_text>
        Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop:&nbsp;.
    </old_text></well-formed-content><old_text>
</old_text></root>

Fig 7.- Result of a first try to convert the xml in Fig 6 (Not well-formed!)

The problem with this file is that it is not well-formed, and thus, cannot be further processed with a XSLT-processor:

The entity "nbsp" was referenced, but not declared.
XML checking finished.

Fig 8.- Result of the well-formedness checking for the xml in Fig 7

This workaround does the trick (the match="/" template adds the &nbsp; entity):

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="no" version="1.0"
                encoding="UTF-8" standalone="yes" />

    <!-- Add an html entity to the xml character entities declaration. -->
    <xsl:template match="/">
        <xsl:text disable-output-escaping="yes"><![CDATA[<!DOCTYPE root
[
    <!ENTITY nbsp "&#160;">
]>
]]>
        </xsl:text>
        <xsl:apply-templates select="*" />
    </xsl:template>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*|text()|@*|comment()|processing-instruction()" />
        </xsl:copy>
    </xsl:template>

    <!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
    <xsl:template match="@*|comment()|processing-instruction()">
        <xsl:copy-of select="." />
    </xsl:template>

    <!-- Text-nodes: Wrap them in a new node without escaping it. -->
    <!-- (note precondition: CDATA should be valid xml.           -->
    <xsl:template match="text()">
        <xsl:element name="old_text">
            <xsl:value-of select="." disable-output-escaping="yes" />
        </xsl:element>
    </xsl:template>

</xsl:stylesheet> 

Fig 9.- The xslt creates the entity declaration

Now, after applying this xslt to the Fig 6 source xml, this is the intermediate xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><!DOCTYPE root
[
    <!ENTITY nbsp "&#160;">
]>

        <root><old_text>
    </old_text><well-formed-content><old_text>
        Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop:&nbsp;.
    </old_text></well-formed-content><old_text>
</old_text></root>

Fig 10.- Intermediate xml (xml from Fig 3 plus entity declaration)

You can use the xslt transformation from Fig 4 to produce the final xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root>
    <well-formed-content>
        Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop: .
    </well-formed-content>
</root>

Fig 11.- Final xml with html entites converted to UTF-8


Notes

For these examples I have used NetBeans 7.1.2 built-in XSLT processor (com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl - default JRE XSLT processor)

Disclaimer: I'm not an XML expert. I have the feeling that this should be even easier...

Backwoodsman answered 15/6, 2012 at 19:49 Comment(0)
A
1

You can definitely remove the cdata from xml by using the regex to remove the desired content from your xml.

for example:

String s = "<sn><![CDATA[poctest]]></sn>";
s = s.replaceAll("!\\[CDATA", "");
s = s.replaceAll("]]", "");
s = s.replaceAll("\\[", "");        

Result will be:

<sn><poctest></sn>

Please check,if this solves your issue.

Agustin answered 22/3, 2016 at 13:55 Comment(0)
L
0

Try this:

public static removeCDATA (String text) {
    String resultString = "";
    Pattern regex = Pattern.compile("(?<!(<!\\[CDATA\\[))|((.*)\\w+\\W)");
    Matcher regexMatcher = regex.matcher(text);
    while (regexMatcher.find()) {
        resultString += regexMatcher.group();
    }
    return resultString;
}

When I call this method with your test input <![CDATA[ Some Text <p>more text and tags</p>... ]]> method return Some Text <p>more text and tags</p>

But I think this method without regular expressions will be more reliable. Something like this:

public static removeCDATA (String text) {
    s = s.trim();
    if (s.startsWith("<![CDATA[")) {
        s = s.substring(9);
        int i = s.indexOf("]]>");
        if (i == -1) throw new IllegalStateException("argument starts with <![CDATA[ but cannot find pairing ]]>");
        s = s.substring(0, i);
    }
    return s;
}
Leeanneleeboard answered 24/11, 2016 at 12:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.