Using XmlSlurper: How to select sub-elements while iterating over a GPathResult
Asked Answered
E

3

8

I am writing an HTML parser, which uses TagSoup to pass a well-formed structure to XMLSlurper.

Here's the generalised code:

def htmlText = """
<html>
<body>
<div id="divId" class="divclass">
<h2>Heading 2</h2>
<ol>
<li><h3><a class="box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li>
<li><h3><a class="box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li>
</ol>
</div>
</body>
</html>
"""     

def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText );

html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

I would expect the each to let me select each 'li' in turn so I can retrieve the corresponding href and address details. Instead, I am getting this output:

#href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111

I've checked various example on the web and these either deal with XML, or are one-liner examples like "retrieve all links from this file". It's seems that the it.h3.a.@href expression is collecting all hrefs in the text, even though I'm passing it a reference to the parent 'li' node.

Can you let me know:

  • Why I'm getting the output shown
  • How I can retrieve the href/address pairs for each 'li' item

Thanks.

Elyse answered 4/11, 2009 at 17:51 Comment(0)
I
11

Replace grep with find:

html.'**'.find { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

then you'll get

#href1: Here is the addressTelephone number: telephone

#href2: Here is another addressAnother telephone: 0845 1111111

grep returns an ArrayList but find returns a NodeChild class:

println html.'**'.grep { it.@class == 'divclass' }.getClass()
println html.'**'.find { it.@class == 'divclass' }.getClass()

results in:

class java.util.ArrayList
class groovy.util.slurpersupport.NodeChild

thus if you wanted to use grep you could then nest another each like this for it to work

html.'**'.grep { it.@class == 'divclass' }.ol.li.each {
    it.each { linkItem ->
        def link = linkItem.h3.a.@href
        def address = linkItem.address.text()
        println "$link: $address\n"
    }
}

Long story short, in your case, use find rather than grep.

Interpellant answered 5/11, 2009 at 5:48 Comment(0)
L
1

This was is a tricky one. When there is just one element with class='divclass' the previous answer sure is fine. If there were multiple results from grep, then a find() for a single result is not the answer. Pointing out that the result is an ArrayList is correct. Inserting an outer nested .each() loop provides a GPathResult in the closure parameter div. From here the drill down can continue with the expected result.

html."**".grep { it.@class == 'divclass' }.each { div -> div.ol.li.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address.text()
   println "$link: $address\n"
}}

The behavior of the original code can use a bit more of an explanation as well. When a property is accessed on a List in Groovy, you'll get a new list (same size) with the property of each element in the list. The list found by grep() has just one entry. Then we get one entry for property ol, which is fine. Next we get the result of ol.it for that entry. It is a list of size() == 1 again, but this time with an entry of size() == 2. We could apply the outer loop there and get the same result, if we wanted to:

html."**".grep { it.@class == 'divclass' }.ol.li.each { it.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address
   println "$link: $address\n"
}}

On any GPathResult representing multiple nodes, we get the concatenation of all text. That is the original result, first for @href, then for address.

Licentiate answered 25/4, 2013 at 16:12 Comment(0)
O
0

I believe the previous answers are all correct at the time of writing, for the version used. But I am using HTTPBuilder 0.7.1 and Grails 2.4.4 with Groovy 2.3.7 and there is a big issue - HTML elements are transformed to uppercase. It appears this is due to NekoHTML used under the hood:

http://nekohtml.sourceforge.net/faq.html#uppercase

Because of this, the solution in the accepted answer must be written as:

html.'**'.find { it.@class == 'divclass' }.OL.LI.each { linkItem ->
    def link = linkItem.H3.A.@href
    def address = linkItem.ADDRESS.text()
    println "$link: $address\n"
}

This was very frustrating to debug, hope it helps someone.

Overmatch answered 2/3, 2015 at 0:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.