The new way using ES7/promises
Usually when you're scraping you want to use some method to
- Get the resource on the webserver (html document usually)
- Read that resource and work with it as
- A DOM/tree structure and make it navigable
- parse it as token-document with something like SAS.
Both tree, and token-parsing have advantages, but tree is usually substantially simpler. We'll do that. Check out request-promise, here is how it works:
const rp = require('request-promise');
const cheerio = require('cheerio'); // Basically jQuery for node.js
const options = {
uri: 'http://www.google.com',
transform: function (body) {
return cheerio.load(body);
}
};
rp(options)
.then(function ($) {
// Process html like you would with jQuery...
})
.catch(function (err) {
// Crawling failed or Cheerio
This is using cheerio which is essentially a lightweight server-side jQuery-esque library (that doesn't need a window object, or jsdom).
Because you're using promises, you can also write this in an asychronous function. It'll look synchronous, but it'll be asynchronous with ES7:
async function parseDocument() {
let $;
try {
$ = await rp(options);
} catch (err) { console.error(err); }
console.log( $('title').text() ); // prints just the text in the <title>
}