Navigating / scraping hashbang links with javascript (phantomjs)

if (phantom.state.length === 0) { if (phantom.args.length === 0) { console.log('Usage: loadreg_1.js <some hash>'); phantom.exit(); } var address = 'http://www.regulations.gov/'; console.log(address); phantom.state = Date.now().toString(); phantom.open(address); } else { var hash = phantom.args[0]; document.location = hash; console.log(document.location.hash); var elapsed = Date.now() - new Date().setTime(phantom.state); if (phantom.loadStatus === 'success') { if (!first_time) { var first_time = true; if (!document.addEventListener) { console.log('Not SUPPORTED!'); } phantom.render('result.png'); var markup = document.documentElement.innerHTML; console.log(markup); phantom.exit(); } } else { console.log('FAIL to load the address'); phantom.exit(); } }

The issue here is that the content of the page loads asynchronously, but you're expecting it to be available as soon as the page is loaded.

In order to scrape a page that loads content asynchronously, you need to wait to scrape until the content you're interested in has been loaded. Depending on the page, there might be different ways of checking, but the easiest is just to check at regular intervals for something you expect to see, until you find it.

The trick here is figuring out what to look for - you need something that won't be present on the page until your desired content has been loaded. In this case, the easiest option I found for top-level pages is to manually input the H1 tags you expect to see on each page, keying them to the hash:

var titleMap = {
    '#!contactUs': 'Contact Us',
    '#!aboutUs': 'About Us'
    // etc for the other pages
};

Then in your success block, you can set a recurring timeout to look for the title you want in an h1 tag. When it shows up, you know you can render the page:

if (phantom.loadStatus === 'success') {
    // set a recurring timeout for 300 milliseconds
    var timeoutId = window.setInterval(function () {
        // check for title element you expect to see
        var h1s = document.querySelectorAll('h1');
        if (h1s) {
            // h1s is a node list, not an array, hence the
            // weird syntax here
            Array.prototype.forEach.call(h1s, function(h1) {
                if (h1.textContent.trim() === titleMap[hash]) {
                    // we found it!
                    console.log('Found H1: ' + h1.textContent.trim());
                    phantom.render('result.png');
                    console.log("Rendered image.");
                    // stop the cycle
                    window.clearInterval(timeoutId);
                    phantom.exit();
                }
            });
            console.log('Found H1 tags, but not ' + titleMap[hash]);
        }
        console.log('No H1 tags found.');
    }, 300);
}

The above code works for me. But it won't work if you need to scrape search results - you'll need to figure out an identifying element or bit of text that you can look for without having to know the title ahead of time.

Edit: Also, it looks like the newest version of PhantomJS now triggers an onResourceReceived event when it gets new data. I haven't looked into this, but you might be able to bind a listener to this event to achieve the same effect.

Recommended topics

Hot tags