XPath along with nokogiri; tutorials/examples? [closed]

Asked 25/10, 2012 at 14:5 Answered 26/10, 2012 at 2:3

I am new to XPath and it seems a bit tricky to me; Sometimes I find it is not working the way I am thinking it should work.

When I scrape data from a website using XPath and Nokogiri, I find it difficult if the website has a complex structure. I use FirePath to get the XPath of an element but sometimes it does not seem to work. I have to remove extra tags added by the browser, like tbody.

I really want to know if there are some good tutorials and examples of XPath and Nokogiri. I could not find much after a Google search.

Ezzell answered 25/10, 2012 at 14:5 Comment(2)

One of the nice things about Nokogiri is it also supports CSS accessors. Sometimes CSS is the faster path to figure out, sometimes XPath is. Feel free to use them interchangeably if necessary. – Hutchins 26/10, 2012 at 1:54

yeah, I use both, css selectors and XPath. Using both together makes it more powerful indeed. – Ezzell 26/10, 2012 at 4:49

The biggest trick to finding an element, or group of elements, using Nokogiri or any XML/HTML parser, is to start with a short accessor to get into the general vicinity of what you're looking for, then iteratively add to it, fine-tuning as you go, until you have what you want.

The second trick is to remember to use // to start your XPath, not /, unless you're absolutely sure you want to start at the root of the document. // is like a '**/*' wildcard at the command-line in Linux. It searches everywhere.

Also, don't trust the XPath or CSS accessor provided by a browser. They do all sorts of fixups to the HTML source, including tbody, like you saw. Instead, use Ruby's OpenURI or curl or wget to retrieve the raw source, and look at it with an editor like vi or vim, or use less or cat it to the screen. There's no chance of having any changes to the file that way.

Finally, it's often easier/faster to break the search into chunks with XPath, then let Ruby iterate through things, than to try to come up with a complex XPath that's harder to maintain or more fragile.

Nokogiri itself is pretty easy. The majority of things you'll want to do are simple combinations of two different methods: search and at. Both take either a CSS or XPath selector. search, along with its sibling methods xpath and css, return a NodeSet, which is basically an array of nodes that you can iterate over. at, css_at and xpath_at return the first node that matches the CSS or XPath accessor. In all those methods, the ...xpath variants accept an XPath, and the ...css ones take a CSS accessor.

Once you have a node, generally you'll want to do one of two things to it, either extract a parameter or get its text/content. You can easily get the attributes using [attribute_to_get] and the text using text.

Using those methods we can search for all the links in a page and return their text and related href, using something like:

require 'awesome_print'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.example.com'))
ap doc.search('a').map{ |a| [a['href'], a.text] }[0, 5]

Which outputs:

[
    [0] [
        [0] "/",
        [1] ""
    ],
    [1] [
        [0] "/domains/",
        [1] "Domains"
    ],
    [2] [
        [0] "/numbers/",
        [1] "Numbers"
    ],
    [3] [
        [0] "/protocols/",
        [1] "Protocols"
    ],
    [4] [
        [0] "/about/",
        [1] "About IANA"
    ]
]

Hutchins answered 26/10, 2012 at 2:3 Comment(0)

I also found that there was a pretty steep learning curve using Nokogiri and XPath at the beginning, but after a lot of trial and error I've now managed to get the hang of both, so hang in there! Nokogiri is really powerful and well worth learning.

Regarding tutorials/examples, I assume you've seen the Nokogiri tutorials page. I can imagine that the level of those tutorials might be a bit high if you're not used to XPath, XML parsing etc.

Some other possible resources:

On XPath, I'd suggest reading this summary in five paragraphs. At its core XPath is fairly simple, just really unintuitive! I find CSS much easier to remember, and I don't think I'm the only one.

But in the end, while tutorials will help, the best thing you can do is to just crack open a console, require 'nokogiri' and start plugging away. After a while it will just start making sense.

Aurangzeb answered 25/10, 2012 at 14:29 Comment(1)

For the first link, the new address is: engineyard.com/blog/getting-started-with-nokogiri – Cheloid 12/8, 2019 at 18:30

Recommended topics

Hot tags