websearch-enginerobots.txtnofollow

how to prevent staging to be indexed in search engines


I would like my staging web sites to no being indexed by search engines (Google as first).

I have heard Wordpress is good at doing this but I would like to be technology agnostic.

Does the robots.txt is enough ? We would like to keep anonymous access to let the customer see it's website without having to be logged in.

Do I have to add nofollow to every pages ?


Solution

  • I'm normally against exposing staging servers to the public web, but if that's the best solution for your workflow, here are a few things you can consider:

    Minimal Approach

    The minimal approach is the very basics to make sure you don't shoot yourself in the foot with having duplicate content everywhere. By registering a separate domain, it's a clean division to the user of what is stage and what isn't. It also is a bit cleaner when you need to move environments around, but that's more operational. CNAMEs will work as well, but remember to register each CNAME with Google and Bing Webmaster Tools. This way you can use the domain removal tool if you need to.

    Advised Approach

    By adding a robots.txt it prevents search engines from accessing and indexing the content. However, that doesn't mean they won't index the URL. If a search engine knows about a given URL, it may add it to the search result index. You'll sometimes see these in the search results. The title tends to be the URL with no description. To prevent this from happening, the search engines need to be told not to show the content or URLs. By adding Authentication infront and not responding with a 200 OK status code it is a strong signal to the engines not to add these URLs to their index. From my experience I haven't ever seen a 401 response code page listed in a search engine index.

    Preferred Approach

    By putting the staging sites behind an IP filter ensures that only your clients are able to access the site. This can be a problem if they want to access it from other computers, and sometimes a maintenance headache but it's the best approach if you don't want to get your staging environment indexed. A word of caution, you'll want to make sure that all other requests (e.g. search engines and non-clients), doesn't serve anything back. They should receive a timeout response and never serve a 200 OK. By serving other information, it could be mistaken for cloaking which you won't want.

    Additionally to be extra safe, I would also add a meta robots or x-robots-header command to each page to NOINDEX, NOFOLLOW just in case IP tables fails from a misconfiguation or if Authentication ever fails ... it's rare, but it happens when there are people touching the configurations for other reasons. Like the robots.txt file, you can really shoot yourself in the foot with these page level robots commands if they ever get pushed out to production. So just make sure your dev / staging environments are in a cleanly separated configuration. Otherwise pushing out a NOINDEX, NOFOLLOW or a Disallow: / would be disastrous for your production site.