Prevent Custom Web Crawler from being blocked

Asked 4/10, 2011 at 6:28 Answered 8/1, 2013 at 11:31

I am creating a new web crawler using C# to crawl some specific websites. every thing goes fine. but the problem is that some websites are blocking my crawler IP address after some requests. I tried using timestamps between my crawl requests. but did not worked.

is there any way to prevent websites from blocking my crawler ? some solutions like this would help (but I need to know how to apply them):

simulating Google bot or yahoo slurp
using multiple IP addresses (event fake IP addresses) as crawler client IP

any solution would help.

Unsound answered 4/10, 2011 at 6:28 Comment(6)

I don't think you should do that... if the website don't wan't to be crawled you shouldn't do that. – Egger 4/10, 2011 at 6:31

If a site is rate limiting you, you better respect that. They may be under resource constraints or whatever. They might as well completely block you. Why not just slow down your bot when that happens? – Fortier 4/10, 2011 at 6:32

some of these websites just block because of average http request rate in about 12 hours. they do not care my crawl rate. this is web. and when you publish a website, you should respect all people who want to see your pages. my question is that how can I crawl these websites event in their wanted crawl rate and I don't care if this is legal or not! – Unsound 4/10, 2011 at 6:38

@Egger - I have to disagree. My crawler has just as much right to visit a site as I myself do. If the site does not want its content crawled, then it should either protect it behind a login flow or not make it publicly accessible in the first place. – Santos 4/10, 2011 at 6:40

Nice comment there Farzin: "And I don't care if this is legal or not!". – Colley 4/10, 2011 at 6:40

He downvoted my answer below too since I put the cure for misbehaving evil ops on the same page :) I proudly take the reputation hits on this one. – Colorfast 22/8, 2014 at 10:37

If speed/throughput is not a huge concern, then probably the best solution is to install Tor and Privoxy and route your crawler through that. Then your crawler will have a randomly changing IP address.

This is a very effective technique if you need to crawl sites that do not want you crawling them. It also provides a layer of protection/anonymity by making the activities of your crawler very difficult to trace back to you.

Of course, if sites are blocking your crawler because it is just going too fast, then perhaps you should just rate-limit it a bit.

Santos answered 4/10, 2011 at 6:35 Comment(1)

Thanks, It helped. I used Tor and sure Privoxy to use it as a web proxy. but another important note was that I should configure tor to change IP address every 5 minutes. Best Regards. – Unsound 4/10, 2011 at 11:6

-2

And this is how you block fakers (just in case someone found this page while searching how to block those)

Block that trick in apache:

# Block fake google when it's not coming from their IP range's 
# (A fake googlebot) [F] => Failure
RewriteCond %{HTTP:X-FORWARDED-FOR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$ [NC]
RewriteRule .* - [F,L]

Or a block in nginx for completeness sake

   map_hash_bucket_size  1024;
   map_hash_max_size     102400;

   map $http_user_agent $is_bot {
      default 0;
      ~(crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider)$ 1;
   }

   geo $not_google {
      default     1;
      66.0.0.0/8  0;
   }

   map $http_user_agent $bots {
      default           0;
      ~(?i)googlebot       $not_google;
   }

Colorfast answered 8/1, 2013 at 11:31 Comment(0)

Recommended topics

Hot tags