Detecting 'stealth' web-crawlers
Asked Answered
T

11

113

What options are there to detect web-crawlers that do not want to be detected?

(I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.)

I'm not talking about the nice crawlers such as Googlebot and Yahoo! Slurp. I consider a bot nice if it:

  1. identifies itself as a bot in the user agent string
  2. reads robots.txt (and obeys it)

I'm talking about the bad crawlers, hiding behind common user agents, using my bandwidth and never giving me anything in return.

There are some trapdoors that can be constructed updated list (thanks Chris, gs):

  1. Adding a directory only listed (marked as disallow) in the robots.txt,
  2. Adding invisible links (possibly marked as rel="nofollow"?),
    • style="display: none;" on link or parent container
    • placed underneath another element with higher z-index
  3. detect who doesn't understand CaPiTaLiSaTioN,
  4. detect who tries to post replies but always fail the Captcha.
  5. detect GET requests to POST-only resources
  6. detect interval between requests
  7. detect order of pages requested
  8. detect who (consistently) requests HTTPS resources over HTTP
  9. detect who does not request image file (this in combination with a list of user-agents of known image capable browsers works surprisingly nice)

Some traps would be triggered by both 'good' and 'bad' bots. you could combine those with a whitelist:

  1. It trigger a trap
  2. It request robots.txt?
  3. It doest not trigger another trap because it obeyed robots.txt

One other important thing here is: Please consider blind people using a screen readers: give people a way to contact you, or solve a (non-image) Captcha to continue browsing.

What methods are there to automatically detect the web crawlers trying to mask themselves as normal human visitors.

The question is not: How do I catch every crawler. The question is: How can I maximize the chance of detecting a crawler.

Some spiders are really good, and actually parse and understand HTML, xhtml, CSS JavaScript, VBScript etc... I have no illusions: I won't be able to beat them.

You would however be surprised how stupid some crawlers are. With the best example of stupidity (in my opinion) being: cast all URLs to lower case before requesting them.

And then there is a whole bunch of crawlers that are just 'not good enough' to avoid the various trapdoors.

Thieve answered 24/10, 2008 at 11:47 Comment(0)
A
17

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically as needed by checking claimed user-agent strings and, if the client claimed to be a legitimate spider but not on the whitelist, it performed DNS/reverse-DNS lookups to verify that the source IP address corresponds to the claimed owner of the bot. As a failsafe, these actions were reported to the admin by email, along with links to black/whitelist the address in case of an incorrect assessment.

I haven't talked to that client in 6 months or so, but, last I heard, the system was performing quite effectively.

Side point: If you're thinking about doing a similar detection system based on hit-rate-limiting, be sure to use at least one-minute (and preferably at least five-minute) totals. I see a lot of people talking about these kinds of schemes who want to block anyone who tops 5-10 hits in a second, which may generate false positives on image-heavy pages (unless images are excluded from the tally) and will generate false positives when someone like me finds an interesting site that he wants to read all of, so he opens up all the links in tabs to load in the background while he reads the first one.

Afrikaner answered 22/11, 2008 at 18:38 Comment(1)
I find that false positive due to blocking web crawler absolutely kills web traffic. You are basically pissing off the 99.8% of your user in a poor attempt to hinder crawlers that can easily bypass this naive method described. Never a good idea to deny user access or hinder it because it destroys the user experience with your site.Lapotin
S
15

See Project Honeypot - they're setting up bot traps on large scale (and have DNSRBL with their IPs).

Use tricky URLs and HTML:

<a href="//example.com/"> = http://example.com/ on http pages.
<a href="page&amp;&#x23;hash"> = page& + #hash

In HTML you can use plenty of tricks with comments, CDATA elements, entities, etc:

<a href="foo<!--bar-->"> (comment should not be removed)
<script>var haha = '<a href="bot">'</script>
<script>// <!-- </script> <!--><a href="bot"> <!-->
Slype answered 21/11, 2008 at 21:56 Comment(0)
A
10

An easy solution is to create a link and make it invisible

<a href="iamabot.script" style="display:none;">Don't click me!</a>

Of course you should expect that some people who look at the source code follow that link just to see where it leads. But you could present those users with a captcha...

Valid crawlers would, of course, also follow the link. But you should not implement a rel=nofollow, but look for the sign of a valid crawler. (like the user agent)

Archeology answered 24/10, 2008 at 13:32 Comment(4)
Unless the bot checks the CSS attributes of the link and doesn't follow the link because it's not visible to a human user...Iulus
Labelling the link "DO NOT click me" would be a better idea.. If someone has CSS disabled (or no CSS support), the link will be visible..Allegro
Good idea. Perhaps change the text to "." and the css style to match the background - making it invisible to most users? Or, run a script to hide it after 1 second leaving it only visible to a bot who can't link the javascript hide command to the link?Phallic
Beware of black hat penalty from SEO perspective.Exert
H
7

One thing you didn't list, that are used commonly to detect bad crawlers.

Hit speed, good web crawlers will break their hits up so they don't deluge a site with requests. Bad ones will do one of three things:

  1. hit sequential links one after the other
  2. hit sequential links in some paralell sequence (2 or more at a time.)
  3. hit sequential links at a fixed interval

Also, some offline browsing programs will slurp up a number of pages, I'm not sure what kind of threshold you'd want to use, to start blocking by IP address.

This method will also catch mirroring programs like fmirror or wget.

If the bot randomizes the time interval, you could check to see if the links are traversed in a sequential or depth-first manner, or you can see if the bot is traversing a huge amount of text (as in words to read) in a too-short period of time. Some sites limit the number of requests per hour, also.

Actually, I heard an idea somewhere, I don't remember where, that if a user gets too much data, in terms of kilobytes, they can be presented with a captcha asking them to prove they aren't a bot. I've never seen that implemented though.

Update on Hiding Links

As far as hiding links goes, you can put a div under another, with CSS (placing it first in the draw order) and possibly setting the z-order. A bot could not ignore that, without parsing all your javascript to see if it is a menu. To some extent, links inside invisible DIV elements also can't be ignored without the bot parsing all the javascript.

Taking that idea to completion, uncalled javascript which could potentially show the hidden elements would possilby fool a subset of javascript parsing bots. And, it is not a lot of work to implement.

Home answered 24/10, 2008 at 13:8 Comment(3)
Major flaw with "ignoring JavaScript means you're a bot" methods: Some of us use the NoScript plugin. No site runs JavaScript on me unless I whitelist the site and I'm pretty sure I'm not a bot.Afrikaner
bots can execute Javascript now...it's 2013 for christ sakes. so there goes the whole argument. who says web crawlers visits sites in sequential selections? another huge assumption.Lapotin
The javascript was only for the showing of a honeypot link. The idea is that the bots will parse the javascript that will make a honeypot link visible, making them more likely to follow the link. However for a real user, the code that makes the link visible would never be executed. Thus NoScript users, along with anyone that doesn't go randomly executing functions would be fine. That said, I'm not sure why/how a bot would randomly be executing code, and if it was doing a static analysis to determine if an element might become visible, that would be one fancy bot.Emory
S
4

It's not actually that easy to keep up with the good user agent strings. Browser versions come and go. Making a statistic about user agent strings by different behaviors can reveal interesting things.

I don't know how far this could be automated, but at least it is one differentiating thing.

Sabadell answered 24/10, 2008 at 14:58 Comment(0)
S
4

One simple bot detection method I've heard of for forms is the hidden input technique. If you are trying to secure a form put a input in the form with an id that looks completely legit. Then use css in an external file to hide it. Or if you are really paranoid, setup something like jquery to hide the input box on page load. If you do this right I imagine it would be very hard for a bot to figure out. You know those bots have it in there nature to fill out everything on a page especially if you give your hidden input an id of something like id="fname", etc.

Sweater answered 7/4, 2009 at 20:4 Comment(1)
not if the bots are able to wait for the jquery to finish, just like a regular browser can. This would've worked well in the early 00sLapotin
A
3

People keep addressing broad crawlers but not crawlers that are specialized for your website.

I write stealth crawlers and if they are individually built no amount of honey pots or hidden links will have any effect whatsoever - the only real way to detect specialised crawlers is by inspecting connection patterns.

The best systems use AI (e.g. Linkedin) use AI to address this.
The easiest solution is write log parsers that analyze IP connections and simply blacklist those IPs or serve captcha, at least temporary.

e.g.
if IP X is seen every 2 seconds connecting to foo.com/cars/*.html but not any other pages - it's most likely a bot or a hungry power user.

Alternatively there are various javascript challenges that act as protection (e.g. Cloudflare's anti-bot system), but those are easily solvable, you can write something custom and that might be enough deterrent to make it not worth the effort for the crawler.

However you must ask a question are you willing to false-positive legit users and introduce inconvenience for them to prevent bot traffic. Protecting public data is an impossible paradox.

Adjutant answered 24/10, 2008 at 11:47 Comment(0)
S
3

Untested, but here is a nice list of user-agents you could make a regular expression out of. Could get you most of the way there:

ADSARobot|ah-ha|almaden|aktuelles|Anarchie|amzn_assoc|ASPSeek|ASSORT|ATHENS|Atomz|attach|attache|autoemailspider|BackWeb|Bandit|BatchFTP|bdfetch|big.brother|BlackWidow|bmclient|Boston\ Project|BravoBrian\ SpiderEngine\ MarcoPolo|Bot\ mailto:[email protected]|Buddy|Bullseye|bumblebee|capture|CherryPicker|ChinaClaw|CICC|clipping|Collector|Copier|Crescent|Crescent\ Internet\ ToolPak|Custo|cyberalert|DA$|Deweb|diagem|Digger|Digimarc|DIIbot|DISCo|DISCo\ Pump|DISCoFinder|Download\ Demon|Download\ Wonder|Downloader|Drip|DSurf15a|DTS.Agent|EasyDL|eCatch|ecollector|efp@gmx\.net|Email\ Extractor|EirGrabber|email|EmailCollector|EmailSiphon|EmailWolf|Express\ WebPictures|ExtractorPro|EyeNetIE|FavOrg|fastlwspider|Favorites\ Sweeper|Fetch|FEZhead|FileHound|FlashGet\ WebWasher|FlickBot|fluffy|FrontPage|GalaxyBot|Generic|Getleft|GetRight|GetSmart|GetWeb!|GetWebPage|gigabaz|Girafabot|Go\!Zilla|Go!Zilla|Go-Ahead-Got-It|GornKer|gotit|Grabber|GrabNet|Grafula|Green\ Research|grub-client|Harvest|hhjhj@yahoo|hloader|HMView|HomePageSearch|http\ generic|HTTrack|httpdown|httrack|ia_archiver|IBM_Planetwide|Image\ Stripper|Image\ Sucker|imagefetch|IncyWincy|Indy*Library|Indy\ Library|informant|Ingelin|InterGET|Internet\ Ninja|InternetLinkagent|Internet\ Ninja|InternetSeer\.com|Iria|Irvine|JBH*agent|JetCar|JOC|JOC\ Web\ Spider|JustView|KWebGet|Lachesis|larbin|LeechFTP|LexiBot|lftp|libwww|likse|Link|Link*Sleuth|LINKS\ ARoMATIZED|LinkWalker|LWP|lwp-trivial|Mag-Net|Magnet|Mac\ Finder|Mag-Net|Mass\ Downloader|MCspider|Memo|Microsoft.URL|MIDown\ tool|Mirror|Missigua\ Locator|Mister\ PiX|MMMtoCrawl\/UrlDispatcherLLL|^Mozilla$|Mozilla.*Indy|Mozilla.*NEWT|Mozilla*MSIECrawler|MS\ FrontPage*|MSFrontPage|MSIECrawler|MSProxy|multithreaddb|nationaldirectory|Navroad|NearSite|NetAnts|NetCarta|NetMechanic|netprospector|NetResearchServer|NetSpider|Net\ Vampire|NetZIP|NetZip\ Downloader|NetZippy|NEWT|NICErsPRO|Ninja|NPBot|Octopus|Offline\ Explorer|Offline\ Navigator|OpaL|Openfind|OpenTextSiteCrawler|OrangeBot|PageGrabber|Papa\ Foto|PackRat|pavuk|pcBrowser|PersonaPilot|Ping|PingALink|Pockey|Proxy|psbot|PSurf|puf|Pump|PushSite|QRVA|RealDownload|Reaper|Recorder|ReGet|replacer|RepoMonkey|Robozilla|Rover|RPT-HTTPClient|Rsync|Scooter|SearchExpress|searchhippo|searchterms\.it|Second\ Street\ Research|Seeker|Shai|Siphon|sitecheck|sitecheck.internetseer.com|SiteSnagger|SlySearch|SmartDownload|snagger|Snake|SpaceBison|Spegla|SpiderBot|sproose|SqWorm|Stripper|Sucker|SuperBot|SuperHTTP|Surfbot|SurfWalker|Szukacz|tAkeOut|tarspider|Teleport\ Pro|Templeton|TrueRobot|TV33_Mercator|UIowaCrawler|UtilMind|URLSpiderPro|URL_Spider_Pro|Vacuum|vagabondo|vayala|visibilitygap|VoidEYE|vspider|Web\ Downloader|w3mir|Web\ Data\ Extractor|Web\ Image\ Collector|Web\ Sucker|Wweb|WebAuto|WebBandit|web\.by\.mail|Webclipping|webcollage|webcollector|WebCopier|webcraft@bea|webdevil|webdownloader|Webdup|WebEMailExtrac|WebFetch|WebGo\ IS|WebHook|Webinator|WebLeacher|WEBMASTERS|WebMiner|WebMirror|webmole|WebReaper|WebSauger|Website|Website\ eXtractor|Website\ Quester|WebSnake|Webster|WebStripper|websucker|webvac|webwalk|webweasel|WebWhacker|WebZIP|Wget|Whacker|whizbang|WhosTalking|Widow|WISEbot|WWWOFFLE|x-Tractor|^Xaldon\ WebSpider|WUMPUS|Xenu|XGET|Zeus.*Webster|Zeus [NC]

Taken from: http://perishablepress.com/press/2007/10/15/ultimate-htaccess-blacklist-2-compressed-version/

Sarette answered 24/10, 2008 at 11:47 Comment(0)
D
1

You can also check referrals. No referral could raise bot suspition. Bad referral means certainly it is not browser.

Adding invisible links (possibly marked as rel="nofollow"?),

* style="display: none;" on link or parent container
* placed underneath another element with higher z-index

I would'nt do that. You can end up blacklisted by google for black hat SEO :)

Datestamp answered 24/10, 2008 at 11:47 Comment(1)
What exactly and why would it get you black listed?Smothers
P
1

I currently work for a company that scans web sites in order to classify them. We also check sites for malware.

In my experience the number one blockers of our web crawler (which of course uses a IE or Firefox UA and does not obey robots.txt. Duh.) are sites intentionally hosting malware. It's a pain because the site then falls back to a human who has to manually load the site, classify it and check it for malware.

I'm just saying, by blocking web crawlers you're putting yourself in some bad company.

Of course, if they are horribly rude and suck up tons of your bandwidth it's a different story because then you've got a good reason.

Polydactyl answered 24/10, 2008 at 11:47 Comment(6)
I'm sorry, but if you run a crawler that does not obey robots.txt, you are not obeying the rules. By not obeying the rules, you yourself are putting yourself in some really bad company. By suggesting that enforcing the rules as set by the owner of the website (in robots.txt) is bad practice, you are wrongfully flipping the issue upside down. You basically state that you do not understand who the rightfull owner of content is.Thieve
@Jacco: If a crawler looking for malware obeyed the rules, it would never find any. Go talk to the malware authors.Polydactyl
@Jacco: Percentage of legit sites that try to block non-compliant crawlers? Under 1%. Malware sites that try? Over 60%. So yeah, it is suspicious.Polydactyl
@Thieve actually no, there are no owners of a content on the web if it's public. Someone who is doing this without copying and pasting manually should be given an award not punishment. This whole copyright concept needs to be abolished on the internet. Only creative innovation and trust can build value and worthy of people's attention, not by some threat of opaque legal veil.Lapotin
All this outlier indicates is that the person running the site put a lot of time and thought into the site, and they have some degree of technical skill. So of course that rules out most of the bell curve.Toscanini
@jacco robots.txt is only for search engines and not for malware detection crawlers, If it's publicly accessible webpage, any crawler can visit it, it's just ethics that search engines follow if the owner does not want a particular page to be indexed, it can (and is) crawled, irrespective of robots.txtIntervalometer
F
-1

short answer: if a mid level programmer knows what he's doing you won't be able to detect a crawler without affecting the real user. Having your information publicly you won't be able to defend it against a crawler... it's like the 1st amendment right :)

Fairish answered 24/10, 2008 at 11:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.