headless internet browser? [closed]
Asked Answered
D

14

71

I would like to do the following. Log into a website, click a couple of specific links, then click a download link. I'd like to run this as either a scheduled task on windows or cron job on Linux. I'm not picky about the language I use, but I'd like this to run with out putting a browser window up on the screen if possible.

Demimonde answered 2/5, 2009 at 12:13 Comment(2)
Why instantiate a browser if you are not going to display it? There are libraries in most languages for transferring files through URLs. Tell us your implementation language and we might point you in the right direction.Flash
Also tell us if you are going to need JavaScript support, because this is important. Some libraries do not have built in JS interpreters.Skell
V
161

Here are a list of headless browsers that I know about:

  • HtmlUnit - Java. Custom browser engine. Limited JavaScript support/DOM emulated. Open source.
  • Ghost - Python only. WebKit-based. Full JavaScript support. Open source.
  • Twill - Python/command line. Custom browser engine. No JavaScript. Open source.
  • PhantomJS - Command line/all platforms. WebKit-based. Full JavaScript support. Open source.
  • Awesomium - C++/.NET/all platforms. Chromium-based. Full JavaScript support. Commercial/free.
  • SimpleBrowser - .NET 4/C#. Custom browser engine. No JavaScript support. Open source.
  • ZombieJS - Node.js. Custom browser engine. JavaScript support/emulated DOM. Open source. Based on jsdom.
  • EnvJS - JavaScript via Java/Rhino. Custom browser engine. JavaScript support/emulated DOM. Open source.
  • Watir-webdriver with headless gem - Ruby via WebDriver. Full JS Support via Browsers (Firefox/Chrome/Safari/IE).
  • Spynner - Python only. PyQT and WebKit.
  • jsdom - Node.js. Custom browser engine. Supports JS via emulated DOM. Open source.
  • TrifleJS - port of PhantomJS using MSIE (Trident) and V8. Open source.
  • ui4j - Pure Java 8 solution. A wrapper library around the JavaFx WebKit Engine incl. headless modes.
  • Chromium Embedded Framework - Full up-to-date embedded version of Chromium with off-screen rendering as needed. C/C++, with .NET wrappers (and other languages). As it is Chromium, it has support for everything. BSD licensed.
  • Selenium WebDriver - Full support for JavaScript via browsers (Firefox, IE, Chrome, Safari, Opera). Officially supported bindings are C#, Java, JavaScript, Haskell, Perl, Ruby, PHP, Python, Objective-C, and R. Unofficial bindings are available for Qt and Go. Open source.

Headless browsers that have JavaScript support via an emulated DOM generally have issues with some sites that use more advanced/obscure browser features, or have functionality that has visual dependencies (e.g. via CSS positions and so forth), so whilst the pure JavaScript support in these browsers is generally complete, the actual supported browser functionality should be considered as partial only.

(Note: Original version of this post only mentioned HtmlUnit, hence the comments. If you know of other headless browser implementations and have edit rights, feel free to edit this post and add them.)

Vassalage answered 2/5, 2009 at 14:15 Comment(9)
+1, HTMLUnit's JS support is a big plusFootpace
This seems like the best bet I've found so far in my search for a headless browser w/ JS support.Charissa
JS support for HTMLUnit is terrible. it's not the answer im afraid.Vibrant
Nothing but problems with HtmlUnit's javascript. Consider it a JS-less browser.Cortie
HtmlUnit and HttpUnit are both unfortunately pre-Ajax. They were written for an era when Javascript was used for little more than form-validation (you can completely forget about something like JQuery EVER working under either one), and from what I've read, neither one is likely to ever support "modern" Javascript just because it would require either a complete rewrite of their Javascript engine, or its replacement by another one whose bindings would likely be so different from the original one, it would require a de-facto rewrite of the whole framework to accommodate it.Element
@Element Interesting. You'll note of course that the thread is from 2009, but good to know nonetheless. I've made a small edit to the list to indicate "limited" JavaScript.Vassalage
trifleJs (triflejs.org) uses Trident (Internet Explorer rendering engine) and an API based on PhantomJS.Paroxysm
@LuisCantero thanks, I've added it to the list.Vassalage
I've recently used google chrome in headless mode described at developers.google.com/web/updates/2017/04/headless-chrome Was easy to use with chrome version 60Amaze
F
5

Check out twill, a very convenient scripting language for precisely what you're looking for. From the examples:

setlocal username <your username>
setlocal password <your password>

go http://www.slashdot.org/
formvalue 1 unickname $username
formvalue 1 upasswd $password
submit

code 200     # make sure form submission is correct!

There's also a Python API if you're looking for more flexibility.

Footpace answered 11/5, 2009 at 9:8 Comment(0)
I
4

Have a look at PhantomJS, a JavaScript based automation framework available for Windows, Mac OS X, Linux, other *ix systems.

Using PhantomJS, you can do things like this:

console.log('Loading a web page');

var page = new WebPage();
var url = "http://www.phantomjs.org/";

page.open(url, function (status) {
    // perform your task once the page is ready ...
    phantom.exit();
});

Or evaluate a page's title:

var page = require('webpage').create();
page.open(url, function (status) {
    var title = page.evaluate(function () {
        return document.title;
    });
    console.log('Page title is ' + title);
});

Examples from PhantomJS' Quickstart page. You can even render a page to a PNG, JPEG or PDF using the render() method.

Isolationism answered 19/4, 2012 at 22:42 Comment(2)
this answer helped me save the source after javascript ran.: https://mcmap.net/q/276337/-save-html-output-of-page-after-execution-of-the-page-39-s-javascriptAstrogate
A rather dumb question, but maybe you tested it: Is PhantomJS expected to work on sites that require username/password?Devotee
W
2

I once did that using the Internet Explorer ActiveX control (WebBrowser, MSHTML). You can instantiate it without making it visible.

This can be done with any language which supports COM (Delphi, VB6, VB.net, C#, C++, ...)

Of course this is a quick-and-dirty solution and might not be appropriate in your situation.

Woorali answered 2/5, 2009 at 12:18 Comment(0)
C
2

PhantomJS is a headless WebKit-based browser that you can script with JavaScript.

Consumable answered 8/11, 2011 at 17:21 Comment(0)
H
1

Except for the auto-download of the file (as that is a dialog box) a win form with the embedded webcontrol will do this.

You could look at Watin and Watin Recorder. They may help with C# code that can login to your website, navigate to a URL and possibly even help automate the file download.

YMMV though.

Hinge answered 2/5, 2009 at 12:19 Comment(0)
S
1

If the links are known (e.g, you don't have to search the page for them), then you can probably use wget. I believe that it will do the state management across multiple fetches.

If you are a little more enterprising, then I would delve into the new goodies in Python 3.0. They redid the interface to their HTTP stack and, IMHO, have a very nice interface that is susceptible to this type of scripting.

Sid answered 2/5, 2009 at 12:27 Comment(0)
M
1

Node.js with YUI on the server. Check out this video: http://www.yuiblog.com/blog/2010/09/29/video-glass-node/

The guy in this video Dav Glass shows an example of how he uses node to fetch a page from Digg. He then attached YUI to the DOM he grabbed and can completely manipulate it.

Matos answered 18/3, 2011 at 13:57 Comment(0)
M
1

If you use PHP - try http://mink.behat.org/

Marcelmarcela answered 23/10, 2011 at 17:15 Comment(0)
A
0

You can use Watir with Ruby or Watin with mono.

Atlante answered 2/5, 2009 at 12:28 Comment(0)
E
0

Also you can use Live Http Headers (Firefox extension) to record headers which are sent to site (Login -> Links -> Download Link) and then replicate them with php using fsockopen. Only thing which you'll probably need to variate is the cookie's value which you receive from login page.

Elflock answered 2/5, 2009 at 12:29 Comment(0)
R
0

libCURL could be used to create something like this.

Robrobaina answered 2/5, 2009 at 13:15 Comment(0)
C
0

Can you not just use a download manager?

There's better ones, but FlashGet has browser-integration, and supports authentication. You can login, click a bunch of links and queue them up and schedule the download.

You could write something that, say, acts as a proxy which catches specific links and queues them for later download, or a Javascript bookmarklet that modifies links to go to "http://localhost:1234/download_queuer?url=" + $link.href and have that queue the downloads - but you'd be reinventing the download-manager-wheel, and with authentication it can be more complicated..

Or, if you want the "login, click links" bit to be automated also - look into screen-scraping.. Basically you load the page via a HTTP library, find the download links and download them..

Slightly simplified example, using Python:

import urllib
from BeautifulSoup import BeautifulSoup
src = urllib.urlopen("http://%s:%[email protected]" % ("username", "password"))
soup = BeautifulSoup(src)

for link_tag in soup.findAll("a"):
    link = link_tag["href"]
    filename = link.split("/")[-1] # get everything after last /
    urllib.urlretrieve(link, filename)

That would download every link on example.com, after authenticating with the username/password of "username" and "password". You could, of course, find more specific links using BeautifulSoup's HTML selector's (for example, you could find all links with the class "download", or URL's that start with http://cdn.example.com).

You could do the same in pretty much any language..

Conformity answered 2/5, 2009 at 13:26 Comment(0)
W
0

.NET contains System.Windows.Forms.WebBrowser. You can create an instance of this, send it to a URL, and then easily parse the html on that page. You could then follow any links you found, etc.

I have worked with this object only minimally, so I'm no expert, but if you're already familiar with .NET then it would probably be worth looking into.

Woodprint answered 2/5, 2009 at 14:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.