Ignore URLs in robot.txt with specific parameters?

Asked 5/2, 2012 at 13:55 Answered 4/5, 2015 at 17:51

104

I would like Google to ignore URLs like this:

http://www.mydomain.example/new-printers?dir=asc&order=price&p=3

In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?

Cantoris answered 5/2, 2012 at 13:55 Comment(0)

176

Here's a solutions if you want to disallow query strings:

Disallow: /*?*

or if you want to be more precise on your query string:

Disallow: /*?dir=*&order=*&p=*

You can also add to the robots.txt which url to allow

Allow: /new-printer$

The $ will make sure only the /new-printer will be allowed.

More info:

http://code.google.com/web/controlcrawlindex/docs/robots_txt.html

http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/

Audriaaudrie answered 5/2, 2012 at 14:17 Comment(9)

this will disallow new-printers I only want to disorder the querystring part – Cantoris 5/2, 2012 at 15:2

so you want to allow /new-printer but not /new-printers?dir=*&order=*&p=*?? – Audriaaudrie 5/2, 2012 at 15:5

Are those advanced wildcards and the allow directive supported well? – Toddy 15/1, 2013 at 14:34

According to robotstxt.org/robotstxt.html - "there is no "Allow" field" – Currey 22/4, 2013 at 9:38

Taking the new-printers example a bit further, what if different combinations and orders of parameters are possible on that file. Can you specify in a single query that a specific file should be disallowed if any kind of parameters are added to it without explicitly specifying them? Would... Disallow: /new-printer?* work? – Bullish 27/8, 2014 at 21:20

@Bullish the last command should work. It will follow the same logic as the first condition. I never tried it so I can't guarantee it will work. – Audriaaudrie 27/8, 2014 at 22:25

@JamieEdwards it's true that "Allow" is technically speaking not part of the standard, but most of the popular search engines do support it. Allow lines should be before Disallow lines though. – Fabrianne 6/10, 2014 at 16:36

@BookOfZeus Will the page will be crawled or not? If we add the said condition in robots – Bordelaise 2/8, 2017 at 8:11

There is now (as of 2019) a proposed standard undergoing ratification, and it does include Allow lines - datatracker.ietf.org/doc/html/draft-koster-rep - perhaps surprisingly, it appears there was no formal "standard" previous to this, and search engines were left to their own devices to operate "by convention" with a "de facto" standard that led to spotty support for Allow lines except for the big ones (eg Google and Bing). – Stob 4/6, 2021 at 13:56

You can block those specific query string parameters with the following lines

Disallow: /*?*dir=
Disallow: /*?*order=
Disallow: /*?*p=

So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.

Catfall answered 4/5, 2015 at 17:51 Comment(6)

Does this means that the whole page will not be crawled as long as the above condition is satisfied. – Bordelaise 2/8, 2017 at 8:10

Beware: this will also block parameters which partially match expression, so not only example.com?p=test but also example.com?top=test. – Saucy 1/12, 2019 at 16:32

If you would like to ignore those parameters regardless their position in the URL (first position or next) you can try that : Disallow: /*?dir=* Disallow: /*?order=* Disallow: /*?p=* Disallow: /*&dir=* Disallow: /*&order=* Disallow: /*&p=* – Kaffir 20/2, 2020 at 18:21

Can the ? be ignored? – Traceable 4/11, 2021 at 9:20

If I Disallow: /*?*order=, will it also disallow requests that contain reorder=? – Heraclea 21/1, 2022 at 10:27

dont forget User-agent: * at start of file – Setting 18/2, 2022 at 3:33

Site Configuration -> URL Parameters

You should have the pages that contain those parameters indicate that they should be excluded from indexing via the robots meta tag. e.g.

Toddy answered 5/2, 2012 at 15:3 Comment(3)

While the original question mentions Google specifically, it's important to note that the Google WebMaster Tools would only block Google. Adding the Disallow rules in the robots.txt file would address other search engines as well. – Fernand 14/1, 2013 at 20:37

True. It should also be clarified that robots.txt does not stop Google indexing pages but stops it reading their content. The best solution is using the robots meta tag on the page itself. This is supported by all systems. – Toddy 15/1, 2013 at 14:35

Note that this doesn't work anymore since they removed that functionality, see developers.google.com/search/blog/2022/03/… – Hatteras 4/8, 2022 at 8:15

Recommended topics

Hot tags