how to prevent staging to be indexed in search engines
Asked Answered
S

5

19

I would like my staging web sites to no being indexed by search engines (Google as first).

I have heard Wordpress is good at doing this but I would like to be technology agnostic.

Does the robots.txt is enough ? We would like to keep anonymous access to let the customer see it's website without having to be logged in.

Do I have to add nofollow to every pages ?

Squama answered 30/8, 2012 at 13:27 Comment(0)
M
30

I'm normally against exposing staging servers to the public web, but if that's the best solution for your workflow, here are a few things you can consider:

Minimal Approach

  • Create new domain for staging server (e.g. example-stage.com)
  • Add robots.txt => Disallow: /
  • Verify domain in Google & Bing Webmaster Tools

The minimal approach is the very basics to make sure you don't shoot yourself in the foot with having duplicate content everywhere. By registering a separate domain, it's a clean division to the user of what is stage and what isn't. It also is a bit cleaner when you need to move environments around, but that's more operational. CNAMEs will work as well, but remember to register each CNAME with Google and Bing Webmaster Tools. This way you can use the domain removal tool if you need to.

Advised Approach

  • Add Authentication (HTTP or otherwise) infront of requests
  • Respond with appropriate response code if not permitted (e.g. 401 Unauthorized)
  • Everything else in the Basic Approach above

By adding a robots.txt it prevents search engines from accessing and indexing the content. However, that doesn't mean they won't index the URL. If a search engine knows about a given URL, it may add it to the search result index. You'll sometimes see these in the search results. The title tends to be the URL with no description. To prevent this from happening, the search engines need to be told not to show the content or URLs. By adding Authentication infront and not responding with a 200 OK status code it is a strong signal to the engines not to add these URLs to their index. From my experience I haven't ever seen a 401 response code page listed in a search engine index.

Preferred Approach

  • Put staging sites behind IP tables (e.g. accessible only from a given IP range)
  • Add meta or x-robots commands to each page with a value of NOINDEX, NOFOLLOW
  • Everything else in the Advised Approach

By putting the staging sites behind an IP filter ensures that only your clients are able to access the site. This can be a problem if they want to access it from other computers, and sometimes a maintenance headache but it's the best approach if you don't want to get your staging environment indexed. A word of caution, you'll want to make sure that all other requests (e.g. search engines and non-clients), doesn't serve anything back. They should receive a timeout response and never serve a 200 OK. By serving other information, it could be mistaken for cloaking which you won't want.

Additionally to be extra safe, I would also add a meta robots or x-robots-header command to each page to NOINDEX, NOFOLLOW just in case IP tables fails from a misconfiguation or if Authentication ever fails ... it's rare, but it happens when there are people touching the configurations for other reasons. Like the robots.txt file, you can really shoot yourself in the foot with these page level robots commands if they ever get pushed out to production. So just make sure your dev / staging environments are in a cleanly separated configuration. Otherwise pushing out a NOINDEX, NOFOLLOW or a Disallow: / would be disastrous for your production site.

Mychael answered 31/8, 2012 at 17:27 Comment(3)
"Add meta or x-robots commands to each page with a value of NOINDEX, NOFOLLOW" seems the good point with "Add robots.txt => Disallow: /". The rest of your answer is a too much restricted area for me: " We would like to keep anonymous access" . I will try to see what happens. Thanks for your answer.Squama
If it's all the same code base, wouldn't modifying the robots.txt file cause any server to be ignored, not just staging?Deenadeenya
@AndrewMortimer ... The assumption is that you have config files that define different settings / robots.txt files for development, staging, and production. So the config file will be read by the server / environment ... if the environment is the staging environment, it'll use the staging configuration.Mychael
S
3

You can disable this server wide by adding the below setting in globally in apache conf or the same parameters can be used in vhost for disabling it for particular vhost only.

Header set X-Robots-Tag "noindex, nofollow"

Once this is done you can test it by verifying apache headers returned.

curl -I staging.mywebsite.com HTTP/1.1 302 Found Date: Sat, 26 Nov 2016 22:36:33 GMT Server: Apache/2.4.18 (Ubuntu) Location: /pages/ X-Robots-Tag: noindex, nofollow Content-Type: text/html; charset=UTF-8
Schrader answered 26/11, 2016 at 22:49 Comment(0)
V
1

I added this code to my site (coded in php):

if( $_SERVER['HTTP_HOST'] == 'test.ate.io' ) {
    header("X-Robots-Tag: noindex, nofollow", true);    
}

That way, even if my config file from staging accidentally gets pushed to my production server there won't be any problems.

Volding answered 3/8, 2013 at 18:29 Comment(0)
H
0

TLDR; Create a robots.txt file in your root web directory. This file should contain one line:

Disallow: /

This is sufficient to prevent Google and Bing bots from indexing your website and appearing in search results.

Hemicellulose answered 27/10, 2015 at 3:51 Comment(0)
T
0

Add the following meta tag into the section of your page:

<meta name="robots" content="noindex">

To prevent only Google from indexing a page:

<meta name="googlebot" content="noindex">
Thyme answered 13/10, 2020 at 8:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.