robots.txtread-the-docsgoogle-search-consolexml-sitemap

ReadTheDocs robots.txt and sitemap.xml


ReadTheDocs auto-generates a robots.txt and sitemap.xml for projects. Each time I deploy a new minor version of my project (ex. 4.1.10), I hide previous minor versions (ex. 4.1.9). ReadTheDocs adds entries for all versions to sitemap.xml, but hidden versions are also added to robots.txt. The result is that submitted sitemaps to Google Search Console, at this point, result in "Submitted URL blocked by robots.txt" errors, since the previous sitemap entry is now blocked by the newly generated robots.txt.

ReadTheDocs generates a sitemap URL for each version, so we have an entry like this for 4.1.9, for example:

<url>
   <loc>https://pyngrok.readthedocs.io/en/4.1.9/</loc>
   <lastmod>2020-08-12T18:57:47.140663+00:00</lastmod>
   <changefreq>monthly</changefreq>
   <priority>0.7</priority>
</url>

And when 4.1.10 is release and the previous minor version is hidden, the newly generated robots.txt gets:

Disallow: /en/4.1.9/ # Hidden version

I believe this Disallow is what then causes the Google crawler to throw the error.

Realistically, all I want in the sitemap.xml are latest, develop, and stable, I don't much care to have every version crawled. But all I'm able to configure, as I understand it from ReadTheDocs docs, is a static robots.txt.

What I want is to publish a static sitemap.xml of my own instead of using the auto-generated one. Any way to accomplish this?


Solution

  • After playing around with a few ideas, here is the solution I came other with. Since this question is asked frequently and often opened as a bug against ReadTheDocs on GitHub (which it's not, it just appears to be poorly supported and/or documented), I'll share my workaround here for others to find.

    As mentioned above and in the docs, while ReadTheDocs allows you to override the auto-generated robots.txt and publish your own, you can't with sitemap.xml. Unclear why. Regardless, you can simply publish a different sitemap.xml, I named mine sitemap-index.xml, then, tell your robots.txt to point to your custom sitemap.

    For my custom sitemap-index.xml, I only put the pages I care about rather then ever generated version (since stable and latest are really what I want search engines to be crawling, not versioned pages):

    <?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
            xmlns:xhtml="http://www.w3.org/1999/xhtml">
        <url>
            <loc>https://pyngrok.readthedocs.io/en/stable/</loc>
            <changefreq>weekly</changefreq>
            <priority>1</priority>
        </url>
        <url>
            <loc>https://pyngrok.readthedocs.io/en/latest/</loc>
            <changefreq>daily</changefreq>
            <priority>0.9</priority>
        </url>
        <url>
            <loc>https://pyngrok.readthedocs.io/en/develop/</loc>
            <changefreq>monthly</changefreq>
            <priority>0.1</priority>
        </url>
    </urlset>
    

    I created my own robots.txt that tells Google not to crawl anything except my main branches and points to my custom sitemap-index.xml.

    User-agent: *
    
    Disallow: /
    
    Allow: /en/stable
    
    Allow: /en/latest
    
    Allow: /en/develop
    
    Sitemap: https://pyngrok.readthedocs.io/en/latest/sitemap-index.xml
    

    I put these two files under /docs/_html, and to my Sphinx conf.py file (which is in /docs) I added:

    html_extra_path = ["_html"]
    

    Here is this shown in the repo too, for reference.

    After ReadTheDocs rebuilds the necessary branches, give /en/latest/sitemap-index.xml to Google Search Console instead of the default one, ask Google to reprocess your robots.txt, and not only will the crawl errors be resolved, Google will properly index a site that hides previous minor versions.