Ignore URLs in robot.txt with specific parameters?
Asked Answered
C

3

104

I would like Google to ignore URLs like this:

http://www.mydomain.example/new-printers?dir=asc&order=price&p=3

In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?

Cantoris answered 5/2, 2012 at 13:55 Comment(0)
A
176

Here's a solutions if you want to disallow query strings:

Disallow: /*?*

or if you want to be more precise on your query string:

Disallow: /*?dir=*&order=*&p=*

You can also add to the robots.txt which url to allow

Allow: /new-printer$

The $ will make sure only the /new-printer will be allowed.

More info:

http://code.google.com/web/controlcrawlindex/docs/robots_txt.html

http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/

Audriaaudrie answered 5/2, 2012 at 14:17 Comment(9)
this will disallow new-printers I only want to disorder the querystring partCantoris
so you want to allow /new-printer but not /new-printers?dir=*&order=*&p=*??Audriaaudrie
Are those advanced wildcards and the allow directive supported well?Toddy
According to robotstxt.org/robotstxt.html - "there is no "Allow" field"Currey
Taking the new-printers example a bit further, what if different combinations and orders of parameters are possible on that file. Can you specify in a single query that a specific file should be disallowed if any kind of parameters are added to it without explicitly specifying them? Would... Disallow: /new-printer?* work?Bullish
@Bullish the last command should work. It will follow the same logic as the first condition. I never tried it so I can't guarantee it will work.Audriaaudrie
@JamieEdwards it's true that "Allow" is technically speaking not part of the standard, but most of the popular search engines do support it. Allow lines should be before Disallow lines though.Fabrianne
@BookOfZeus Will the page will be crawled or not? If we add the said condition in robotsBordelaise
There is now (as of 2019) a proposed standard undergoing ratification, and it does include Allow lines - datatracker.ietf.org/doc/html/draft-koster-rep - perhaps surprisingly, it appears there was no formal "standard" previous to this, and search engines were left to their own devices to operate "by convention" with a "de facto" standard that led to spotty support for Allow lines except for the big ones (eg Google and Bing).Stob
C
40

You can block those specific query string parameters with the following lines

Disallow: /*?*dir=
Disallow: /*?*order=
Disallow: /*?*p=

So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.

Catfall answered 4/5, 2015 at 17:51 Comment(6)
Does this means that the whole page will not be crawled as long as the above condition is satisfied.Bordelaise
Beware: this will also block parameters which partially match expression, so not only example.com?p=test but also example.com?top=test.Saucy
If you would like to ignore those parameters regardless their position in the URL (first position or next) you can try that : Disallow: /*?dir=* Disallow: /*?order=* Disallow: /*?p=* Disallow: /*&dir=* Disallow: /*&order=* Disallow: /*&p=*Kaffir
Can the ? be ignored?Traceable
If I Disallow: /*?*order=, will it also disallow requests that contain reorder=?Heraclea
dont forget User-agent: * at start of fileSetting
T
0

Register your website with Google WebMaster Tools. There you can tell Google how to deal with your parameters.

Site Configuration -> URL Parameters

You should have the pages that contain those parameters indicate that they should be excluded from indexing via the robots meta tag. e.g.

Toddy answered 5/2, 2012 at 15:3 Comment(3)
While the original question mentions Google specifically, it's important to note that the Google WebMaster Tools would only block Google. Adding the Disallow rules in the robots.txt file would address other search engines as well.Fernand
True. It should also be clarified that robots.txt does not stop Google indexing pages but stops it reading their content. The best solution is using the robots meta tag on the page itself. This is supported by all systems.Toddy
Note that this doesn't work anymore since they removed that functionality, see developers.google.com/search/blog/2022/03/…Hatteras

© 2022 - 2024 — McMap. All rights reserved.