robots.txt

Robots.txt - prevent index of .html files


I want to prevent index of *.html files on our site - so that just clean urls are indexed.

So I would like www.example.com/en/login indexed but not www.example.com/en/login/index.html

Currently I have:

User-agent: *
Disallow: /
Disallow: /**.html   - not working
Allow: /$
Allow: /*/login*

I know I can just disallow e.g. Disallow: /*/login/index.html, but my issue is I have a number of these .html files that I do not want indexed - so wondered if there was a way to Disallow them all instead of doing them individually?


Solution

  • First of all, you keep using the word "indexed", so I want to ensure that you're aware that the robots.txt convention is only about suggesting to automated crawlers that they avoid certain URLs on your domain, but pages listed in a robots.txt file can still show up on search engine indexes if they have other data about the page. For instance, Google explicitly states they will still index and list a URL, even if they're not allowed to crawl it. I just wanted you to be aware of that in case you are using the word "indexed" to mean "listed in a search engine" rather than "getting crawled by an automated program".

    Secondly, there's no standard way to accomplish what you're asking for. Per "The Web Robots Pages":

    Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".

    That being said, it's a common addition that many crawlers do support. For example, in Google's documentation of they directives they support, they describe pattern matching support that does handle using * as a wildcard. So, you could add a Disallow: /*.html$ directive and then Google would not crawl URLs ending with .html, though they could still end up in search results.

    But, if your primary goal is telling search engines what URL you consider "clean" and preferred, then what you're actually looking for is specifying Canonical URLs. You can put a link rel="canonical" element on each page with your preferred URL for that page, and search engines that use that element will use it in order to determine which path to prefer when displaying that page.