How to list loaded resources with Selenium/PhantomJS?
Asked Answered
C

3

10

I want to load a webpage and list all loaded resources (javascript/images/css) for that page. I use this code to load the page:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://example.com')

The code above works perfectly and I can do some processing to the HTML page. The question is, how do I list all of the resources loaded by that page? I want something like this:

['http://example.com/img/logo.png',
 'http://example.com/css/style.css',
 'http://example.com/js/jquery.js',
 'http://www.google-analytics.com/ga.js']

I also open to other solution, like using PySide.QWebView module. I just want to list the resources loaded by page.

Chitter answered 5/11, 2013 at 10:16 Comment(1)
This is exactly what I need to accomplish also. Ghost.py had/has a super direct way of doing that, except Ghost.py doesn't seem to work very well.Eckmann
H
4

This is not a Selenium solution, but it can work really well with python and PhantomJS.

The idea is to do exactly the same as in the 'Network' tab in Chrome Developper Tools. To do so we have to listen to every request made by the webpage.

Javascript/Phantomjs part

Using phantomjs, this can be done using this script, use it at your own convenience :

// getResources.js
// Usage: 
// ./phantomjs --ssl-protocol=any --web-security=false getResources.js your_url
// the ssl-protocol and web-security flags are added to dismiss SSL errors

var page = require('webpage').create();
var system = require('system');
var urls = Array();

// function to check if the requested resource is an image
function isImg(url) {
  var acceptedExts = ['jpg', 'jpeg', 'png'];
  var baseUrl = url.split('?')[0];
  var ext = baseUrl.split('.').pop().toLowerCase();
  if (acceptedExts.indexOf(ext) > -1) {
    return true;
  } else {
    return false;
  }
}

// function to check if an url has a given extension
function isExt(url, ext) {
  var baseUrl = url.split('?')[0];
  var fileExt = baseUrl.split('.').pop().toLowerCase();
  if (ext == fileExt) {
    return true;
  } else {
    return false;
  }
}

// Listen for all requests made by the webpage, 
// (like the 'Network' tab of Chrome developper tools)
// and add them to an array
page.onResourceRequested = function(request, networkRequest) { 
  // If the requested url if the one of the webpage, do nothing
  // to allow other ressource requests
  if (system.args[1] == request.url) {
    return;
  } else if (isImg(request.url) || isExt(request.url, 'js') || isExt(request.url, 'css')) {
    // The url is an image, css or js file 
    // add it to the array
    urls.push(request.url)
    // abort the request for a better response time
    // can be omitted for collecting asynchronous loaded files
    networkRequest.abort(); 
  }
};

// When all requests are made, output the array to the console
page.onLoadFinished = function(status) {
  console.log(JSON.stringify(urls));
  phantom.exit();
};

// If an error occur, dismiss it
page.onResourceError = function(){
  return false;
}
page.onError = function(){
  return false;
}

// Open the web page
page.open(system.args[1]);

Python part

And now call the code in python with:

from subprocess import check_output
import json

out = check_output(['./phantomjs', '--ssl-protocol=any', \
    '--web-security=false', 'getResources.js', your_url])
data = json.loads(out)

Hope this helps

Hoptoad answered 27/11, 2014 at 0:55 Comment(2)
Indeed but the Selenium webdriver api doesn't give us full access of phantomjs api, I edited my answer to show how you can use this script using python.Hoptoad
how to this with java?Nystatin
R
2

Here is a pure-python solution using Selenium and the ChromeDriver.

How it works:

  1. First we create a minimalistic HTTP-proxy listening on localhost. This proxy is the one responsible for printing whatever requests are generated by Selenium.
    (NOTE: we are using multiprocessing to avoid splitting the script in two, but you could just as well have the proxy part in a separate script)
  2. Then we create the webdriver, configured with the proxy from step 1, read URLs from standard in and load them serially.
    Loading the URLs in parallel is left as an exercise for the reader ;)

To use this script, you just type URLs in the standard in, and it spits out the loaded URLs (with respective referrers) on standard out. The code:

#!/usr/bin/python3

import sys
import time
import socketserver
import http.server
import urllib.request
from multiprocessing import Process

from selenium import webdriver

PROXY_PORT = 8889
PROXY_URL = 'localhost:%d' % PROXY_PORT

class Proxy(http.server.SimpleHTTPRequestHandler):
    def do_GET(self):
        sys.stdout.write('%s → %s\n' % (self.headers.get('Referer', 'NO_REFERER'), self.path))
        self.copyfile(urllib.request.urlopen(self.path), self.wfile)
        sys.stdout.flush()

    @classmethod
    def target(cls):
        httpd = socketserver.ThreadingTCPServer(('', PROXY_PORT), cls)
        httpd.serve_forever()

p_proxy = Process(target=Proxy.target)
p_proxy.start()


webdriver.DesiredCapabilities.CHROME['proxy'] = {
    "httpProxy":PROXY_URL,
    "ftpProxy":None,
    "sslProxy":None,
    "noProxy":None,
    "proxyType":"MANUAL",
    "class":"org.openqa.selenium.Proxy",
    "autodetect":False
}

driver = webdriver.Chrome('/usr/lib/chromium-browser/chromedriver')
for url in sys.stdin:
    driver.get(url)
driver.close()
del driver
p_proxy.terminate()
p_proxy.join()
# avoid warnings about selenium.Service not shutting down in time
time.sleep(3)
Readjustment answered 24/4, 2018 at 13:26 Comment(0)
H
1

There isn't a function in webdribver that would return all the resources web page has, but what you could do is something like this:

from selenium.webdriver.common.by import By
images = driver.find_elements(By.TAG_NAME, "img")

and the same for script and link.

Holeproof answered 5/11, 2013 at 11:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.