Scrape web pages in real time with Node.js
Asked Answered
A

9

66

What's a good was to scrape website content using Node.js. I'd like to build something very, very fast that can execute searches in the style of kayak.com, where one query is dispatched to several different sites, the results scraped, and returned to the client as they become available.

Let's assume that this script should just provide the results in JSON format, and we can process them either directly in the browser or in another web application.

A few starting points:

Using node.js and jquery to scrape websites

Anybody have any ideas?

Agger answered 6/3, 2011 at 15:47 Comment(3)
I feel like your second link answers your own questionZinck
as the author of node.io I can vouch for this ;)Nuts
Does this answer your question? How can I scrape pages with dynamic content using node.js?Leff
A
24

Node.io seems to take the cake :-)

Agger answered 12/3, 2011 at 15:24 Comment(2)
as the author I can vouch for node.io ;)Nuts
Node.io is no longer maintained. It's dead, the original domain name was sold. This answer isn't current.Chapnick
N
5

All aforementioned solutions presume running the scraper locally. This means you will be severely limited in performance (due to running them in sequence or in a limited set of threads). A better approach, imho, is to rely on an existing, albeit commercial, scraping grid.

Here is an example:

var bobik = new Bobik("YOUR_AUTH_TOKEN");
bobik.scrape({
  urls: ['amazon.com', 'zynga.com', 'http://finance.google.com/', 'http://shopping.yahoo.com'],
  queries:  ["//th", "//img/@src", "return document.title", "return $('script').length", "#logo", ".logo"]
}, function (scraped_data) {
  if (!scraped_data) {
    console.log("Data is unavailable");
    return;
  }
  var scraped_urls = Object.keys(scraped_data);
  for (var url in scraped_urls)
    console.log("Results from " + url + ": " + scraped_data[scraped_urls[url]]);
});

Here, scraping is performed remotely and a callback is issued to your code only when results are ready (there is also an option to collect results as they become available).

You can download Bobik client proxy SDK at https://github.com/emirkin/bobik_javascript_sdk

Neurology answered 14/7, 2012 at 15:44 Comment(0)
P
2

I've been doing research myself, and https://npmjs.org/package/wscraper boasts itself as a

a web scraper agent based on cheerio.js a fast, flexible, and lean implementation of core jQuery; built on top of request.js; inspired by http-agent.js

Very low usage (according to npmjs.org) but worth a look for any interested parties.

Pearlstein answered 3/6, 2013 at 23:49 Comment(0)
O
1

You don't always need to jQuery. If you play with the DOM returned from jsdom for example you can easily take what you need yourself (also considering you dont have to worry about xbrowser issues.) See: https://gist.github.com/1335009 that's not taking away from node.io at all, just saying you might be able to do it yourself depending...

Opuntia answered 24/4, 2012 at 21:16 Comment(0)
C
1

The new way using ES7/promises

Usually when you're scraping you want to use some method to

  1. Get the resource on the webserver (html document usually)
  2. Read that resource and work with it as
    1. A DOM/tree structure and make it navigable
    2. parse it as token-document with something like SAS.

Both tree, and token-parsing have advantages, but tree is usually substantially simpler. We'll do that. Check out request-promise, here is how it works:

const rp = require('request-promise');
const cheerio = require('cheerio'); // Basically jQuery for node.js 

const options = {
    uri: 'http://www.google.com',
    transform: function (body) {
        return cheerio.load(body);
    }
};

rp(options)
    .then(function ($) {
        // Process html like you would with jQuery... 
    })
    .catch(function (err) {
        // Crawling failed or Cheerio 

This is using cheerio which is essentially a lightweight server-side jQuery-esque library (that doesn't need a window object, or jsdom).

Because you're using promises, you can also write this in an asychronous function. It'll look synchronous, but it'll be asynchronous with ES7:

async function parseDocument() {
    let $;
    try {
      $ = await rp(options);
    } catch (err) { console.error(err); }

    console.log( $('title').text() ); // prints just the text in the <title>
}
Chapnick answered 31/5, 2016 at 2:17 Comment(0)
L
0

It is my easy to use but badly spelled general purpose scraper https://github.com/harish2704/html-scraper written for Node.JS It can extract information based on predefined schemas. A schema defnition includes a css selector and a data extraction function. It currently using cheerio for dom parsing..

Laughlin answered 19/5, 2014 at 5:25 Comment(0)
C
0

check out https://github.com/rc0x03/node-promise-parser

Fast: uses libxml C bindings
Lightweight: no dependencies like jQuery, cheerio, or jsdom
Clean: promise based interface- no more nested callbacks
Flexible: supports both CSS and XPath selectors
Conjugal answered 9/6, 2014 at 18:20 Comment(0)
A
0

I see most answers the right path with cheerio and so forth, however once you get to the point where you need to parse and execute JavaScript (ala SPA's and more), then I'd check out https://github.com/joelgriffith/navalia (I'm the author). Navalia is built to support scraping in a headless-browser context, and it's pretty quick. Thanks!

Anyways answered 1/7, 2017 at 17:34 Comment(0)
C
0

I think there are two different questions to this.

  1. "I'd like to build something very, very fast that can execute searches... to several different sites". To do anything fast, especially multiple tasks (Since you want to scrape multiple sites), I suggest learning "multithreading in Nodejs." Maybe this post from DigitalOcean can help: How to use multithreading in NodeJS

  2. Second, on scraping with Nodejs. It depends on the site you want to scrape, if they are "staticly/server rendered", then you can use Cheerio to parse the HTML results in nice structured format. If it's a javascript-rendered website than you must use something like puppeteer that can simulate certain action like a real visitor. You can read this post that highlight the different between scraping website in javascript using Puppeteer VS Cheerio. Hope it helps!

Other trick I should mention, is looking at the external script that website uses, sometimes the data you're looking for is available there!

I never run a multithreading task on Javascript before, but it seems very possible!

Conceptualize answered 26/10, 2023 at 8:30 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.