seorobots.txt

Can I use the “Host” directive in robots.txt?


Searching for specific information on the robots.txt, I stumbled upon a Yandex help page on this topic. It suggests that I could use the Host directive to tell crawlers my preferred mirror domain:

User-Agent: *
Disallow: /dir/
Host: www.example.com

Also, the Wikipedia article states that Google too understands the Host directive, but there wasn’t much (i.e. none) information.

At robotstxt.org, I didn’t find anything on Host (or Crawl-delay as stated on Wikipedia).

  1. Is it encouraged to use the Host directive at all?
  2. Are there any resources at Google on this robots.txt specific?
  3. How is compatibility with other crawlers?

At least since the beginning of 2021, the linked entry does not deal with the directive in question any longer.


Solution

  • The original robots.txt specification says:

    Unrecognised headers are ignored.

    They call it "headers" but this term is not defined anywhere. But as it’s mentioned in the section about the format, and in the same paragraph as User-agent and Disallow, it seems safe to assume that "headers" means "field names".

    So yes, you can use Host or any other field name.

    But keep in mind: As they are not specified by the robots.txt project, you can’t be sure that different parsers support this field in the same way. So you’d have to check every supporting parser manually.