Problems with web site scraping using zombie.js
Asked Answered
Q

1

10

I need to do some web scraping. After playing around with different web testing framework, of which most where either too slow (Selenium) or too buggy for my needs (env.js), I decided that zombie.js looks most promising, as it uses a solid set of libraries for HTML parsing and DOM manipulation. However, it seems to me like it doesn't even support basic event-based Javascript code like in the following web page:

<html>
  <head>
    <title>test</title>
    <script type="text/javascript">

      console.log("test script executing...");
      console.log("registering callback for event DOMContentLoaded on " + document);

      document.addEventListener('DOMContentLoaded', function(){
        console.log("DOMContentLoaded triggered");
      }, false);

      function loaded() {
        console.log("onload triggered");
      }

    </script>
  </head>

  <body onload="loaded();">
    <h1>Test</h1>
  </body>
</html>

I then decided to trigger those events manually like this:

zombie = require("zombie");

zombie.visit("http://localhost:4567/", { debug: true }, function (err, browser, status) {

  doc = browser.document;
  console.log("firing DOMContentLoaded on " + doc);
  browser.fire("DOMContentLoaded", doc, function (err, browser, status) {

    body = browser.querySelector("body");
    console.log("firing load on " + body);
    browser.fire("load", body, function (err, browser, status) {

      console.log(browser.html());

    });
  });

});

Which works for this particular test page. My problem is a more general one, though: I want to be able to scrape more complex, AJAX-based sites like a friends list on Facebook (something like http://www.facebook.com/profile.php?id=100000028174850&sk=friends&v=friends). It is no problem to log into the site using zombie, but some content like those lists seem to be completely loaded dynamically using AJAX, and I don't know how to trigger the event handlers that initiate the loading.

There are several questions I have regarding this problem:

  • Has somebody already implemented a similarly complex scraper without using a browser remote-controlling solution like Selenium?
  • Is there some reference on the loading process of a complex Javascript-based page?
  • Can somebody provide advice on how to debug a real browser to see what I might need to execute to trigger the Facebook event handlers?
  • Any other ideas about this topic?

Again, please do not point me to solutions involving controlling a real browser like Selenium, as I know about those. What is however welcome are suggestions for a real in-memory renderer like WebKit accessible from the Ruby scripting language, but preferrably with the possibility to set cookies and preferrably also load raw HTML instead of triggering real HTTP requests.

Quaternion answered 7/9, 2011 at 15:50 Comment(2)
Are you looking for a javascript test framework, or a web data-extraction tool? If you're just looking for a scree-scraping tool, it's possible to scrape most sites without executing their Javascript, even AJAX-heavy ones.Doukhobor
The question is about web scraping. You are right, often is indeed possible to do this without executing Js, e.g. by issuing REST requests manually. In the case of Facebook, scraping the mobile version of the site is quite possible using only HTTP and HTML parsing. But I am interested in a generic solution that understands Javascript and does not require a real browser instance. This seems to be possible, as env.Js and zombie.Js show, but it seems to be a tricks problem.Quaternion
D
13

For purposes of data extraction, running a "headless browser" and triggering javascript events manually is not going to be the easiest thing to do. While not impossible, there are simpler ways to do it.

Most sites, even AJAX-heavy ones, can be scraped without executing a single line of their Javascript code. In fact it's usually easier than trying to figure out a site's Javascript code, which is often obfuscated, minified, and difficult to debug. If you have a solid understanding of HTTP you will understand why: (almost) all interactions with the server are encoded as HTTP requests, so whether they are initiated by Javascript, or the user clicking a link, or custom code in a bot program, there's no difference to the server. (I say almost because when Flash or applets get involved there's no telling what data is flying where; they can be application-specific. But anything done in Javascript will go over HTTP.)

That being said, it is possible to mimic a user on any website using custom software. First you have to be able to see the raw HTTP requests being sent to the server. You can use a proxy server to record requests made by a real browser to the target website. There are many, many tools you can use for this: Charles or Fiddler are handy, most dedicated screen-scraper tools have a basic proxy built-in, The Firebug extension for Firefox and Chrome have similar tools for viewing AJAX requests...you get the idea.

Once you can see the HTTP requests that are made as a result of a particular action on the website, it is easy to write a program to mimic these requests; just send the same requests to the server and it will treat your program just like a browser in which a particular action has been performed.

There are differing libraries for different languages offering different capabilities. For ruby, I have seen a lot of people using mechanize for ruby.

If data extraction is your only goal, then you'll almost always be able to get what you need by mimicking HTTP requests this way. No Javascript required.

Note - Since you mentioned Facebook, I should mention that scraping Facebook specifically can be exceptionally difficult (although not impossible), because Facebook has measures in place to detect automated access (they use more than just captchas); they will disable an account if they see suspicious activity coming from it. It is, after all, against their terms of service (section 3.2).

Doukhobor answered 7/9, 2011 at 18:0 Comment(5)
Thank you for formulating this sophisticated answer to the question. I already use Firebug and Fiddler2 for monitoring HTTP traffic to and from webservers, which is however not very useful if a hard-to-reverse communication theme is used, as done by many social networking sites. But even if it is possible to use the low-level interface to talk to a web server and extract information, this will require constant tweaking of the scraper, which can be very time-consuming. Env.js (which I almost got to work as I want) shows that it is in fact possible to simulate a real browser programmatically.Quaternion
It's true, facebook and other sites try to make it as hard as possible for you to scrape their sites; they prefer you to use their API's so they can better control what your program accesses, and therefore better protect their users' privacy.Doukhobor
Still, emulating a surfing user with a tool like Selenium seems to work without problems (except for slowness), I didn't encounter any obstacles except for the extensive use of dynamic content. Even OAuth is not secured at all against automated access, authentication can be scripted without any problems and does not even require Javascript to work.Quaternion
Websites now check the order and timing of HTTP requests is (statistically) correct.Openmouthed
Yes, understanding the JS is a way, but many times, the website needs cookie values and you need to reproduce many elements of the process. I think that is more easy click the button with a headless-browser than doing reverse engineering.Bombardier

© 2022 - 2024 — McMap. All rights reserved.