Groovy Node.depthFirst() returning a List of Nodes and Strings?

<xml> <metadata> <article> <body> <sec> <title>A Title</title> <p> This contains <italic>italics</italic> and <xref ref-type="bibr">xref's</xref> . </p> </sec> <sec> <title>Second Title</title> </sec> </body> </article> </metadata> </xml>

def xmlParser = new XmlParser() def xml = xmlParser.parseText(rawXml) xml.metadata.article.body[0].depthFirst().each { node -> if(node.children().size() == 1) { println node.text() } }

I'm pretty sure that's correct behavior (though I've always found the XmlSlurper and XmlParser to have screwy APIs). All things you can iterate through really should implement a node interface IMO and potentially have a type of TEXT that you could use to know to get the text from them.

Those text nodes are valid nodes that in many cases you'd want to hit as it did a depth first traversal through the XML. If they didn't get returned, your algorithm for checking if the children size of 1 wouldn't work because some nodes (like the <p> tag) has both mixed text and elements underneath it.

Also, why depthFirst doesn't consistently return all text nodes where the text is the only child, such as for italic above, makes things even worse.

I tend to like to use the signature of groovy methods to let the runtime figure out which is the right way to handle each node (rather than using something like instanceof) like this:

def rawXml = """<xml>
    <metadata>
        <article>
            <body>
                <sec>
                    <title>A Title</title>
                    <p>
                        This contains 
                        <italic>italics</italic> 
                        and
                        <xref ref-type="bibr">xref's</xref>
                        .
                    </p>
                </sec>
                <sec>
                    <title>Second Title</title>
                </sec>
            </body>
        </article>
    </metadata>
</xml>"""

def processNode(String nodeText) {
    return nodeText
}

def processNode(Object node) {
   if(node.children().size() == 1) {
       return node.text()
   }
}

def xmlParser = new XmlParser()
def xml = xmlParser.parseText(rawXml)
def xmlText = xml.metadata.article.body[0].'**'.findResults { node ->
    processNode(node)
}

println xmlText.join(" ")

Prints

A Title This contains italics and xref's .  Second Title

Alternatively, the XmlSlurper class probably does more what you want/expect it to and has a more reasonable set of output from the text() method. If you really don't need to do any sort of DOM walking with the results (what XmlParser is "better" for), I'd suggest XmlSlurper:

def xmlParser = new XmlSlurper()
def xml = xmlParser.parseText(rawXml)
def bodyText = xml.metadata.article.body[0].text()
println bodyText

Prints:

A Title
                    This contains 
                    italics 
                    and
                    xref's
                    .
                Second Title

Recommended topics

Hot tags