Reliably detecting PhantomJS-based spam bots
Asked Answered
R

3

24

Is there any way to consistently detect PhantomJS/CasperJS? I've been dealing with a spat of malicious spambots built with it and have been able to mostly block them based on certain behaviours, but I'm curious if there's a rock-solid way to know if CasperJS is in use, as dealing with constant adaptations gets slightly annoying.

I don't believe in using Captchas. They are a negative user experience and ReCaptcha has never worked to block spam on my MediaWiki installations. As our site has no user registrations (anonymous discussion board), we'd need to have a Captcha entry for every post. We get several thousand legitimate posts a day and a Captcha would see that number divebomb.

Rhody answered 31/12, 2013 at 20:11 Comment(1)
Did you try QuestyCaptcha, where you choose a static set of question? Unless your site is specifically targeted by spambots, it will be unwinnable for bots and extremely easy for humans.Pectase
V
22

I very much share your take on CAPTCHA. I'll list what I have been able to detect so far, for my own detection script, with similar goals. It's only partial, as they are many more headless browsers.

Fairly safe to use exposed window properties to detect/assume those particular headless browser:

window._phantom (or window.callPhantom) //phantomjs
window.__phantomas //PhantomJS-based web perf metrics + monitoring tool 
window.Buffer //nodejs
window.emit //couchjs
window.spawn  //rhino

The above is gathered from jslint doc and testing with phantom js.

Browser automation drivers (used by BrowserStack or other web capture services for snapshot):

window.webdriver //selenium
window.domAutomation (or window.domAutomationController) //chromium based automation driver

The properties are not always exposed and I am looking into other more robust ways to detect such bots, which I'll probably release as full blown script when done. But that mainly answers your question.

Here is another fairly sound method to detect JS capable headless browsers more broadly:

if (window.outerWidth === 0 && window.outerHeight === 0){ //headless browser }

This should work well because the properties are 0 by default even if a virtual viewport size is set by headless browsers, and by default it can't report a size of a browser window that doesn't exist. In particular, Phantom JS doesn't support outerWith or outerHeight.

ADDENDUM: There is however a Chrome/Blink bug with outer/innerDimensions. Chromium does not report those dimensions when a page loads in a hidden tab, such as when restored from previous session. Safari doesn't seem to have that issue..

Update: Turns out iOS Safari 8+ has a bug with outerWidth & outerHeight at 0, and a Sailfish webview can too. So while it's a signal, it can't be used alone without being mindful of these bugs. Hence, warning: Please don't use this raw snippet unless you really know what you are doing.

PS: If you know of other headless browser properties not listed here, please share in comments.

Vandalize answered 28/6, 2014 at 21:41 Comment(8)
In the case I'm investigating, the navigator.onLine property is FALSE (which also happens to be the case with PhantomJS). Regular browsers I tested all returned TRUE for onLine.Lilongwe
Also, navigator.plugins is empty in my case, which is also consistent with PhantomJS.Lilongwe
@Lilongwe Thanks. navigator.plugins can only be carefully considered as additional combined signal. Because it's also empty in many mobile environments and eventually some desktops in the future.Vandalize
navigator.plugins is indeed weak evidence. In our tests, IE11 and Android Chrome did not report any plugins, desktop Chrome and Firefox did. This can be used to our advantage, however, because we know which browsers should report plugins. For instance, if the browser claims to be desktop Chrome, but fails to provide the list of plugins, we can tell it's forging the User-Agent, which is already suspicious.Lilongwe
In any case, I think that any reasonably future-proof detection method should go after properties that cannot be easily forged. Changing JS-exposed object names, providing a convincing pair of window.outer* properties, setting navigator.onLine to TRUE or even providing a list of plugins should be dead easy for someone with malicious intent. I can think of a few ways, like detecting keyboard/mouse/touch interaction (can be defeated relatively easily) or fingerprinting browser capabilities (should be hard to defeat).Lilongwe
I was thinking along the lines of compiling a list of known User-Agent patterns that are specific to known desktop browser. This assumes that mobile browsers use a User-Agent string that is distinguishable from their desktop counterparts, of course. Even then, I'd only propose this metric to be used in heuristic model of multiple properties.Lilongwe
@Lilongwe Nope they are not distinguishable. And I already I have a JS script in production fully detecting spoofed 'desktop mode' agents on mobile. But that subject is outside the scope of this question.Vandalize
Thanks for the update, that indeed renders the plugins idea useless in practice.Lilongwe
S
3

There is no rock-solid way: PhantomJS, and Selenium, are just software being used to control browser software, instead of a user controlling it.

With PhantomJS 1.x, in particular, I believe there is some JavaScript you can use to crash the browser that exploits a bug in the version of WebKit being used (it is equivalent to Chrome 13, so very few genuine users should be affected). (I remember this being mentioned on the Phantom mailing list a few months back, but I don't know if the exact JS to use was described.) More generally you could use a combination of user-agent matching up with feature detection. E.g. if a browser claims to be "Chrome 23" but does not have a feature that Chrome 23 has (and that Chrome 13 did not have), then get suspicious.

As a user, I hate CAPTCHAs too. But they are quite effective in that they increase the cost for the spammer: he has to write more software or hire humans to read them. (That is why I think easy CAPTCHAs are good enough: the ones that annoy users are those where you have no idea what it says and have to keep pressing reload to get something you recognize.)

One approach (which I believe Google uses) is to show the CAPTCHA conditionally. E.g. users who are logged-in never get shown it. Users who have already done one post this session are not shown it again. Users from IP addresses in a whitelist (which could be built from previous legitimate posts) are not shown them. Or conversely just show them to users from a blacklist of IP ranges.

I know none of those approaches are perfect, sorry.

Semifluid answered 1/1, 2014 at 23:57 Comment(0)
S
3

You could detect phantom on the client-side by checking window.callPhantom property. The minimal script is on the client side is:

var isPhantom = !!window.callPhantom;

Here is a gist with proof of concept that this works.

A spammer could try to delete this property with page.evaluate and then it depends on who is faster. After you tried the detection you do a reload with the post form and a CAPTCHA or not depending on your detection result.

The problem is that you incur a redirect that might annoy your users. This will be necessary with every detection technique on the client. Which can be subverted and changed with onResourceRequested.

Generally, I don't think that this is possible, because you can only detect on the client and send the result to the server. Adding the CAPTCHA combined with the detection step with only one page load does not really add anything as it could be removed just as easily with phantomjs/casperjs. Defense based on user agent also doesn't make sense since it can be easily changed in phantomjs/casperjs.

Semiconductor answered 28/6, 2014 at 19:43 Comment(2)
page is not exposed to the window. That's an internal phantomjs property requesting the actual page. The question is to detect phantom from the js window scope when the page is executed with phantomjs.Vandalize
It was just to illustrate the working detector. I'll reduce the code if it seems to confuse.Semiconductor

© 2022 - 2024 — McMap. All rights reserved.