Advice for use of honeypot img tag to detect scrapers / bad bots
Asked Answered
R

2

6

We want to setup a little honeypot image in our html bodies to detect scrapers / bad bots.

Has anyone set something like this up before?

We were thinking the best way to go at it would be to:

a) Comment the html out via:

<!-- <img src="http://www.domain.com/honeypot.gif"/> -->

b) Apply css styles to the image that would make it hidden from browsers via:

.... id="honeypot" ....

#honeypot{
    display:none;
    visibility:hidden;
}

Using the above does anyone foresee any situations where a proper and real useragent would pull the image / attempt to render it?

The honeypot.gif would be a mod_rewritten php script where we would do our logging.

While I understand that the above 2 conditions might be skipped by any well coded scraper, it would at least shed some insight on the very dirty ones.

Any other pointers as to the best way to go at this?

Robynroc answered 7/9, 2011 at 20:24 Comment(4)
What is your definition of "bad bots"? What kinds of things are you trying to prevent? A bot that behaves poorly in fetching your pages might not fall victim to a html parsing thing like this - you might not catch it. There might be easier ways to detect what you are looking for.Grizzle
I don't understand how this is a honeypot implementation. Usually it involves a form field which is hidden from the user via script/css that bots unknowingly fill.Gooding
While it may sound overly broad, our definition of a bad bot / scraper is one who does not identify the source product (read: domain.com) via the useragent OR said domain.com does not provide a way to ban access via robots.txt. We see a lot of these little cunts. We already have a fairly comprehensive system that allows us to detect these via useragent / lack there of, lack of accept header, hits / interval, etc etc etc. So this would be a further addition to this system that could give us an extra + on what ips to focus manual manpower on.Robynroc
@Gooding We want to know if someone pulls a document body, and tries to pull all the images contained in the html. If said document body will contain an image that in no way is visible / ever accessed by a user we should be able to shed a little bit of insight on bots / scrapers that are pulling document bodies + all images in that html.Robynroc
C
3

A bot will ignore your img tag because it's within a comment.

Instead, you might consider creating an invisible div which contains a link to a trigger URL on the same site (preferably within the same directory, in case the bot is depth sensitive).

Cattery answered 7/9, 2011 at 20:42 Comment(1)
we ended up doing both, the hidden image and hidden link. thanks!Robynroc
G
0

IMO I think any good scraper is going to know how to pass HTML using a SGML parser, and would just skip the commented image, but I could be wrong.

At most it will give you an idea when it happens, but doesn't provide a way to counter at scraper. You would probably be better off coming up with some kind of cookie based solution, as most bots probably don't care about these. You could also randomize image paths between requests and expire them after a short period.

Checking referrer is an obvious one, if you don't care about browsers that don't support them or people that hide/alter them.

Gooding answered 7/9, 2011 at 21:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.