Advice for use of honeypot img tag to detect scrapers / bad bots

Asked 7/9, 2011 at 20:24 Answered 7/9, 2011 at 21:0

Solved html image detect scraper honeypot

We want to setup a little honeypot image in our html bodies to detect scrapers / bad bots.

Has anyone set something like this up before?

We were thinking the best way to go at it would be to:

a) Comment the html out via:

<!-- <img src="http://www.domain.com/honeypot.gif"/> -->

b) Apply css styles to the image that would make it hidden from browsers via:

.... id="honeypot" ....

#honeypot{
    display:none;
    visibility:hidden;
}

Using the above does anyone foresee any situations where a proper and real useragent would pull the image / attempt to render it?

The honeypot.gif would be a mod_rewritten php script where we would do our logging.

While I understand that the above 2 conditions might be skipped by any well coded scraper, it would at least shed some insight on the very dirty ones.

Any other pointers as to the best way to go at this?

Robynroc answered 7/9, 2011 at 20:24 Comment(4)

What is your definition of "bad bots"? What kinds of things are you trying to prevent? A bot that behaves poorly in fetching your pages might not fall victim to a html parsing thing like this - you might not catch it. There might be easier ways to detect what you are looking for. – Grizzle 7/9, 2011 at 20:33

I don't understand how this is a honeypot implementation. Usually it involves a form field which is hidden from the user via script/css that bots unknowingly fill. – Gooding 7/9, 2011 at 20:39

While it may sound overly broad, our definition of a bad bot / scraper is one who does not identify the source product (read: domain.com) via the useragent OR said domain.com does not provide a way to ban access via robots.txt. We see a lot of these little cunts. We already have a fairly comprehensive system that allows us to detect these via useragent / lack there of, lack of accept header, hits / interval, etc etc etc. So this would be a further addition to this system that could give us an extra + on what ips to focus manual manpower on. – Robynroc 7/9, 2011 at 20:41

@Gooding We want to know if someone pulls a document body, and tries to pull all the images contained in the html. If said document body will contain an image that in no way is visible / ever accessed by a user we should be able to shed a little bit of insight on bots / scrapers that are pulling document bodies + all images in that html. – Robynroc 7/9, 2011 at 20:44

A bot will ignore your img tag because it's within a comment.

Instead, you might consider creating an invisible div which contains a link to a trigger URL on the same site (preferably within the same directory, in case the bot is depth sensitive).

Cattery answered 7/9, 2011 at 20:42 Comment(1)

we ended up doing both, the hidden image and hidden link. thanks! – Robynroc 13/9, 2011 at 8:45

IMO I think any good scraper is going to know how to pass HTML using a SGML parser, and would just skip the commented image, but I could be wrong.

At most it will give you an idea when it happens, but doesn't provide a way to counter at scraper. You would probably be better off coming up with some kind of cookie based solution, as most bots probably don't care about these. You could also randomize image paths between requests and expire them after a short period.

Checking referrer is an obvious one, if you don't care about browsers that don't support them or people that hide/alter them.

Gooding answered 7/9, 2011 at 21:0 Comment(0)

Recommended topics

Hot tags