Searching for specific information on the robots.txt
, I stumbled upon a Yandex help page‡ on this topic. It suggests that I could use the Host
directive to tell crawlers my preferred mirror domain:
User-Agent: *
Disallow: /dir/
Host: www.example.com
Also, the Wikipedia article states that Google too understands the Host
directive, but there wasn’t much (i.e. none) information.
At robotstxt.org, I didn’t find anything on Host
(or Crawl-delay
as stated on Wikipedia).
Host
directive at all?robots.txt
specific?‡ At least since the beginning of 2021, the linked entry does not deal with the directive in question any longer.
The original robots.txt specification says:
Unrecognised headers are ignored.
They call it "headers" but this term is not defined anywhere. But as it’s mentioned in the section about the format, and in the same paragraph as User-agent
and Disallow
, it seems safe to assume that "headers" means "field names".
So yes, you can use Host
or any other field name.
But keep in mind: As they are not specified by the robots.txt project, you can’t be sure that different parsers support this field in the same way. So you’d have to check every supporting parser manually.