I'm hoping someone will just point out something obvious that I'm missing here. I feel like I've done this a hundred times and for some reason tonight, the behavior coming from this is throwing me for a loop.
I'm reading in some XML from a public API. I want to extract all the text from a certain node (everything within 'body'), which also includes a variety of child nodes. Simple example:
<xml>
<metadata>
<article>
<body>
<sec>
<title>A Title</title>
<p>
This contains
<italic>italics</italic>
and
<xref ref-type="bibr">xref's</xref>
.
</p>
</sec>
<sec>
<title>Second Title</title>
</sec>
</body>
</article>
</metadata>
</xml>
So ultimately I want to traverse the tree within the desired node (again, 'body') and extract all the text contained in its natural order. Simple enough, so I just write up this little Groovy script...
def xmlParser = new XmlParser()
def xml = xmlParser.parseText(rawXml)
xml.metadata.article.body[0].depthFirst().each { node ->
if(node.children().size() == 1) {
println node.text()
}
}
...which proceeds to blow up with "No signature of method: java.lang.String.children()". So I'm thinking to myself "wait, what? Am I going crazy?" Node.depthFirst() should only return a List of Node's. I add a little 'instanceof' check and sure enough, I'm getting a combination of Node objects and String objects. Specifically the lines not within entities on the same line are returned as String's, aka "This contains" and "and". Everything else is a Node (as expected).
I can work around this easily. However, this doesn't seem like correct behavior and I'm hoping someone can point me in the right direction.