how to tell if a web request is coming from google's crawler?

T

6

11

From the HTTP server's perspective.

Tartu answered 22/7, 2010 at 12:6 Comment(2)

user-agents.org/index.shtml?g_m – Gauger 22/7, 2010 at 12:10

possible duplicate of Verifying Googlebot in .htaccess file – Guildhall 16/9, 2015 at 21:32

L

12

I have captured google crawler request in my asp.net application and here's how the signature of the google crawler looks.

Requesting IP: 66.249.71.113
Client: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

My logs observe many different IPs for google crawler in 66.249.71.* range. All these IPs are geo-located at Mountain View, CA, USA.

A nice solution to check if the request is coming from Google crawler would be to verify the request to contain Googlebot and http://www.google.com/bot.html. As I said there are many IPs observed with the same requesting client, I'd not recommend to check on IPs. And may be that's where Client identity come into the picture. So go for verifying client identity.

Here's a sample code in C#.

    if (Request.UserAgent.ToLower().Contains("googlebot") || 
             Request.UserAgent.ToLower().Contains("google.com/bot.html"))
    {
        //Yes, it's google bot.
    }
    else
    {
        //No, it's something else.
    }

It's important to note that, any Http-client can easily fake this.

Lovel answered 22/7, 2010 at 12:8 Comment(1)

No, they're found to use wide-range of IPs all in 66.249.71.* – Lovel 22/7, 2010 at 12:15

S

14

You can read the official Verifying Googlebot page.

Quoting the page here:

You can verify that a bot accessing your server really is Googlebot (or another Google user-agent) by using a reverse DNS lookup, verifying that the name is in the googlebot.com domain, and then doing a forward DNS lookup using that googlebot name. This is useful if you're concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.

For example:
> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer  crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).

Shine answered 22/7, 2010 at 12:20 Comment(2)

Is there no way to query google.com or googlebot.com every so often using dns to get the list of ip or ip ranges? Doing this for every incoming request seems painful. Something like an mx record for A or AAAA records. – Paleolithic 8/10, 2021 at 19:42

@Paleolithic I would implement this with caching. If you only look up the IP addresses that you haven't looked up recently, it works very well. – Precursory 5/11, 2021 at 9:55

L

12

I have captured google crawler request in my asp.net application and here's how the signature of the google crawler looks.

Requesting IP: 66.249.71.113
Client: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

My logs observe many different IPs for google crawler in 66.249.71.* range. All these IPs are geo-located at Mountain View, CA, USA.

A nice solution to check if the request is coming from Google crawler would be to verify the request to contain Googlebot and http://www.google.com/bot.html. As I said there are many IPs observed with the same requesting client, I'd not recommend to check on IPs. And may be that's where Client identity come into the picture. So go for verifying client identity.

Here's a sample code in C#.

    if (Request.UserAgent.ToLower().Contains("googlebot") || 
             Request.UserAgent.ToLower().Contains("google.com/bot.html"))
    {
        //Yes, it's google bot.
    }
    else
    {
        //No, it's something else.
    }

It's important to note that, any Http-client can easily fake this.

Lovel answered 22/7, 2010 at 12:8 Comment(1)

No, they're found to use wide-range of IPs all in 66.249.71.* – Lovel 22/7, 2010 at 12:15

P

1

You can now perform an IP address check, by checking against googlebot's published IP address list at https://developers.google.com/search/apis/ipranges/googlebot.json

From the docs:

you can identify Googlebot by IP address by matching the crawler's IP address to the list of Googlebot IP addresses. For all other Google crawlers, match the crawler's IP address against the complete list of Google IP addresses.

Panjandrum answered 7/2, 2022 at 7:9 Comment(0)

A

0

If you're using Apache Webserver, you could have a look at the log file 'log\access.log'.

Then load google's IPs from http://www.iplists.com/nw/google.txt and check whether one of the IPs is contained in your log.

Alex answered 22/7, 2010 at 12:15 Comment(1)

nope, this is not a reliable way to do this since client IPs can change. – Lovel 22/7, 2010 at 12:16

H

0

Based on this. __curious_geek's solution, here's the javascript version:

if(window.navigator.userAgent.match(/googlebot|google\.com\/bot\.html/i)) {
  // Yes, it's google bot.
}

Husk answered 19/10, 2021 at 8:43 Comment(1)

Or it's someone pretending to be a Google bot. – Biddable 8/2, 2022 at 15:12

H

0

To verify if a web request is coming from Google's crawler, you can check the IP address if it falls in the IP ranges posted by Google which can be found here:

https://developers.google.com/search/apis/ipranges/googlebot.json

Alternatively, you can also do a reverse DNS lookup and check if the domain matches one of Google's domains.

Note: You can also check the User-Agent string, but because it can be spoofed it's wise to use one of the methods mentioned above instead.

You can use the NPM package crawl-bot-verifier to verify Google, Bing, Baidu, and many other crawlers, the library does a DNS lookup which is reliable and has a very nice API. You can find the package here:

https://www.npmjs.com/package/crawl-bot-verifier

Harden answered 25/7, 2024 at 1:41 Comment(0)

Recommended topics

Hot tags