How to exclude all robots except Googlebot and Bingbot with both robots.txt and X-Robots-Tag
Asked Answered
M

1

6

I have 2 questions regarding crawlers and robots.

Background info

I only want Google and Bing to be excluded from the “disallow” and “noindex” limitations. In other words, I want ALL search engines except Google and Bing to follow the “disallow” and “noindex” rules. In addition, I would also like a “nosnippet” function for the search engines I mentioned (which all support “nosnippet”). Which code do I use to do this (using both robots.txt and X-Robots-Tag)?

I want to have it in both the robots.txt file as well as the htacess file as an X-Robots-Tag. I understand that robots.txt may be outdated, but I would like clear instructions to crawlers even if they’re considered “ineffective” and “outdated” unless you think otherwise.

Question 1

Did I get the following code right to only allow Google and Bing to index (to prevent other search engines from showing in their results), and, furthermore, prevent Bing and Google from showing snippets in their search results?

X-Robots-Tag code (Is this correct? Don't think I need to add "index" to googlebot and bingbot due to "index" being a default value, but not sure.)

X-Robots-Tag: googlebot: nosnippet
X-Robots-Tag: bingbot: nosnippet
X-Robots-Tag: otherbot: noindex

robots.txt code (Is this correct? I think the 1st one is, but not sure.)

    User-agent: Googlebot
    Disallow:
    User-agent: Bingbot
    Disallow:
    User-agent: *
    Disallow: /

or

    User-agent: *
    Disallow: /
    User-agent: Googlebot
    Disallow:
    User-agent: Bingbot
    Disallow:

Question 2: Conflicts between robots.txt and X-Robots-Tag

I anticipate conflicts between the robots.txt and the X-Robots-Tag due to the disallow function and the noindex functions not being allowed to work in conjunction (Is there any advantage of using X-Robot-Tag instead of robots.txt? ). How do I get around this, and what is your recommendation?

End goal

As mentioned, the main goal of this is to explicitly tell all older robots (still using the robots.txt) and all the newer ones except Google and Bing (using X-Robots-Tag) to not show any of my pages in their search results (which I'm assuming is summed up in the noindex function). I understand they may not all follow it, but I want them ALL to know except Google and Bing to not show my pages in search results. To this end, I am looking to find the right codes for both the robots.txt code and X-Robots-Tag code that will work without conflict for this function for the HTML sites I am trying to build.

Midsummer answered 8/5, 2019 at 22:7 Comment(2)
"I understand that robots.txt may be outdated": Where is this coming from?Otto
Hey, unor, I must be wrong about that. I guess robots.txt is still the main standard for instructing crawlers. I think I incorrectly assumed that everything is changing from robots.txt to X Robots Tag. Very new to this whole thing, and appreciate your efforts to get me on the right track. Thanks for that.Midsummer
O
5

robots.txt is not outdated. It’s still the only open/vendor-agnostic way to control what should not get crawled. X-Robots-Tag (and the corresponding meta-robots) is the only open/vendor-agnostic way to control what should not get indexed.

As you‘re aware, you can’t disallow both for the same URL. There is no way around this. If a bot wants to crawl https://example.com/foo, it (hopefully) checks https://example.com/robots.txt to see if it’s allowed to crawl it:

  • If crawling is allowed, the bot requests the document, and only then learns that it’s not allowed to index it. It has obviously already crawled the document, and it’s still allowed to crawl it.

  • If crawling is disallowed, the bot doesn’t request the document, and thus never learns that it’s also not allowed to index it, because it would need to crawl the document to see the HTTP header or the HTML element.

A Noindex field in robots.txt would solve this conflict, and Google seems to have supported it as experimental feature, but you can’t expect it to work.

So, you have to choose: don’t you want to appear in other search engines’ results (→ X-Robots-Tag), or don’t you want other search engines’ bots to crawl your documents (→ robots.txt).

X-Robots-Tag

If you want to target all other bots (instead of listing each one, like your otherbot suggests, which would virtually be impossible), you should use

X-Robots-Tag: bingbot: nosnippet
X-Robots-Tag: googlebot: nosnippet
X-Robots-Tag: noindex

(I suppose Bingbot/Googlebot ignore the last line, as they already matched a previous line, but to be sure, you could add index to the lines of both bots.)

robots.txt

Records (each record starts with a User-agent line) need to be separated by empty lines:

User-agent: *
Disallow: /

User-agent: Bingbot
Disallow:

User-agent: Googlebot
Disallow:

The order of the records doesn’t matter, unless the bot "listens" to multiple names in your robots.txt (it will follow the first record that matches its name; and only if no name matches, it will follow the * record). So, after adding empty lines, both of your robots.txt files are fine.

You can also use one record for both bots:

User-agent: *
Disallow: /

User-agent: Bingbot
User-agent: Googlebot
Disallow:
Otto answered 10/5, 2019 at 1:35 Comment(9)
Hey Unor, So, that makes sense why I can’t disallow both at the same time. I also now understand that the “noindex” function is an experimental feature of Google, and I shouldn’t expect it to work. Thanks for that detailed information, and I was glad to go through 2 other posts of yours related to this topic.Midsummer
All I need is for all conforming bots (which conform to the robots.txt file) to 1st seek permissions at the robots.txt file, and if they don’t find any (as in the case with Google and Bing with the code you helped me with), to then continue to the URLs affected by the X-Robots-Tag code. So, Bingbot and Googlebot (once they arrive at the URLs affected by the X-Robots-Tag) should then follow the “nosnippet” and, as you suggested, “index” rules.Midsummer
It seems to me this should then keep out all robots.txt conforming crawlers except for Googlebot and Bingbot (through robots.txt), and, furthermore, through X-Robots-Tag, allow me as a user to further specify Googlebot and Bingbot permissions (who were allowed in with robots.txt) with “nosnippet” and “index”. Does this sound correct? Thanks for your very detailed answer! Really appreciate your help. Thanks So Much, VinceMidsummer
@VinceJ: Yes, with both of the snippets from my answer, Googlebot/Bingbot are allowed to crawl everything and to index everything (but without snippet). All other bots are not allowed to crawl anything, but they would be allowed to index.Otto
I was reading one of your previous posts along the lines of the last part of your previous comment that said, "a bot that isn't allowed to crawl a document may still index it (without ever accessing it". Don't all respectable bots first check for robots.txt before following and indexing from any external places, or do even good bots crawl pages without checking robots.txt? Let's assume in this scenario that sitemaps connect all pages of the site.Midsummer
If all robots must check robots.txt first, then the only thing respectable/conforming robots could do would be to index the URL- but none of the URL content. This would then mean that the search engines for conforming robots could only show the URL in search results- and no snippet (because they are supposed to check robots.txt first before they crawled, and since they couldn’t crawl it, there is no snippet to show- just the URL).Midsummer
So just to confirm, if I use the code you gave me, only nonconforming search engines would show snippets- snippets which their nonconforming robots crawled without first checking the robots.txt file? Thanks, Unor!Midsummer
@VinceJ: Most likely, but not necessarily. If a conforming bot (that is not Googlebot/Bingbot) indexes your page (without having crawled it), this bot might also show a snippet showing content taken from a third party (which took the content from your page). But that’s unlikely, because the bot couldn’t verify that this is the actual content (because it’s not allowed to crawl itself) and that it’s relevant content.Otto
Unor, thanks for thoroughly answering my question! Really appreciate you helping me off to a good start! Everything seems clear because of your help. :) Thanks again!Midsummer

© 2022 - 2024 — McMap. All rights reserved.