I need to do some web scraping. After playing around with different web testing framework, of which most where either too slow (Selenium) or too buggy for my needs (env.js), I decided that zombie.js looks most promising, as it uses a solid set of libraries for HTML parsing and DOM manipulation. However, it seems to me like it doesn't even support basic event-based Javascript code like in the following web page:
<html>
<head>
<title>test</title>
<script type="text/javascript">
console.log("test script executing...");
console.log("registering callback for event DOMContentLoaded on " + document);
document.addEventListener('DOMContentLoaded', function(){
console.log("DOMContentLoaded triggered");
}, false);
function loaded() {
console.log("onload triggered");
}
</script>
</head>
<body onload="loaded();">
<h1>Test</h1>
</body>
</html>
I then decided to trigger those events manually like this:
zombie = require("zombie");
zombie.visit("http://localhost:4567/", { debug: true }, function (err, browser, status) {
doc = browser.document;
console.log("firing DOMContentLoaded on " + doc);
browser.fire("DOMContentLoaded", doc, function (err, browser, status) {
body = browser.querySelector("body");
console.log("firing load on " + body);
browser.fire("load", body, function (err, browser, status) {
console.log(browser.html());
});
});
});
Which works for this particular test page. My problem is a more general one, though: I want to be able to scrape more complex, AJAX-based sites like a friends list on Facebook (something like http://www.facebook.com/profile.php?id=100000028174850&sk=friends&v=friends). It is no problem to log into the site using zombie, but some content like those lists seem to be completely loaded dynamically using AJAX, and I don't know how to trigger the event handlers that initiate the loading.
There are several questions I have regarding this problem:
- Has somebody already implemented a similarly complex scraper without using a browser remote-controlling solution like Selenium?
- Is there some reference on the loading process of a complex Javascript-based page?
- Can somebody provide advice on how to debug a real browser to see what I might need to execute to trigger the Facebook event handlers?
- Any other ideas about this topic?
Again, please do not point me to solutions involving controlling a real browser like Selenium, as I know about those. What is however welcome are suggestions for a real in-memory renderer like WebKit accessible from the Ruby scripting language, but preferrably with the possibility to set cookies and preferrably also load raw HTML instead of triggering real HTTP requests.