Does the user agent string have to be exactly as it appears in my server logs?
Asked Answered
M

5

3

When using a Robots.txt file, does the user agent string have to be exactly as it appears in my server logs?

For example when trying to match GoogleBot, can I just use googlebot?

Also, will a partial-match work? For example just using Google?

Mediocrity answered 13/1, 2011 at 1:56 Comment(0)
C
0

Yes, the user agent has to be an exact match.

From robotstxt.org: "globbing and regular expression are not supported in either the User-agent or Disallow lines"

Chivaree answered 13/1, 2011 at 2:1 Comment(5)
Note that "Exact match" is not what the original robots.txt spec (on the same site) recommends.Harebell
"the user agent has to be an exact match." - You certainly shouldn't be using an "exact match" user-agent ("as it appears in [the] server logs") in the User-agent directive in robots.txt. eg. You never see User-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) (which actually fails to match the googlebot according to Google's robots.txt tester). As far as the Googlebot is concerned, a partial-match is required: User-agent: Googlebot.Defeatism
Hi @DocRoot, do you have a reference for what kind of matching is supported? This answer was from 2011, when the robots.txt spec specifically stated that globbing and regular expressions are not supported (although matches should be case-insensitive).Chivaree
@CameronSkinner See unor's answer (linked to above)Defeatism
"globbing and regular expressions are not supported", but that does not mean that it has to be an exact match. A case-insensitive substring match is intended for the User-Agent. That's not an exact match, and it isn't globbing or regular expression either.Saturniid
V
4

At least for googlebot, the user-agent is non-case-sensitive. Read the 'Order of precedence for user-agents' section:

https://code.google.com/intl/de/web/controlcrawlindex/docs/robots_txt.html

Villein answered 13/1, 2011 at 2:2 Comment(0)
H
3

(As already answered in another question)

In the original robots.txt specification (from 1994), it says:

User-agent

[…]

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

[…]

But if/which parsers work like that is another question. Your best bet would be to look for the documentation of the bots you want to add. You’ll typically find the agent identifier string in it, e.g.:

  • Bing:

    We want webmasters to know that bingbot will still honor robots.txt directives written for msnbot, so no change is required to your robots.txt file(s).

  • DuckDuckGo:

    DuckDuckBot is the Web crawler for DuckDuckGo. It respects WWW::RobotRules […]

  • Google:

    The Google user-agent is (appropriately enough) Googlebot.

  • Internet Archive:

    User Agent archive.org_bot is used for our wide crawl of the web. It is designed to respect robots.txt and META robots tags.

Harebell answered 5/8, 2013 at 11:38 Comment(0)
G
1

robots.txt is case-sensitive, although Google is more conservative than other bots, and may accept its string either way, other bots may not.

Galsworthy answered 13/1, 2011 at 2:8 Comment(0)
D
1

Also, will a partial-match work? For example just using Google?

In theory, yes. However, in practise it seems to be specific partial-matches or "substrings" (as mentioned in @unor's answer) that match. These specific "substrings" appear to be referred to as "tokens". And often it must be an exact match for these "tokens".

With regards to the standard Googlebot, this only appears to match Googlebot (case-insensitive). Any lesser partial-match, such as Google, fails to match. Any longer partial-match, such as Googlebot/1.2, fails to match. And using the full user-agent string (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) also fails to match. (Although there is technically more than one user-agent for the Googlebot anyway, so matching on the full user-agent string would not be recommended anyway - even if it did work.)

These tests were performed with Google's robots.txt tester.

Reference:

Defeatism answered 15/6, 2018 at 16:4 Comment(0)
C
0

Yes, the user agent has to be an exact match.

From robotstxt.org: "globbing and regular expression are not supported in either the User-agent or Disallow lines"

Chivaree answered 13/1, 2011 at 2:1 Comment(5)
Note that "Exact match" is not what the original robots.txt spec (on the same site) recommends.Harebell
"the user agent has to be an exact match." - You certainly shouldn't be using an "exact match" user-agent ("as it appears in [the] server logs") in the User-agent directive in robots.txt. eg. You never see User-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) (which actually fails to match the googlebot according to Google's robots.txt tester). As far as the Googlebot is concerned, a partial-match is required: User-agent: Googlebot.Defeatism
Hi @DocRoot, do you have a reference for what kind of matching is supported? This answer was from 2011, when the robots.txt spec specifically stated that globbing and regular expressions are not supported (although matches should be case-insensitive).Chivaree
@CameronSkinner See unor's answer (linked to above)Defeatism
"globbing and regular expressions are not supported", but that does not mean that it has to be an exact match. A case-insensitive substring match is intended for the User-Agent. That's not an exact match, and it isn't globbing or regular expression either.Saturniid

© 2022 - 2024 — McMap. All rights reserved.