php.htaccesscodeignitercodeigniter-3xml-sitemap

Create a sitemap.xml programmatically for a multi-language / multi-domain site on the fly


Note: this is not about how do sitemaps work, or how a sitemap structure looks like, neither a SEO related issue.

My domains mysite.com and mysite.pt are 2 language versions (EN, PT) of the same site. The content is added/removed dynamically via a db-driven CMS.

Each menu/category update creates its language specific routes, e.g: mysite.com/beach and mysite.pt/praia, where both create a route pointing to the same controller e.g. site_manager/page/beaches.

The codebase for each language version is identical, hence there is only one /application, /assets and /system folder for all language versions. The language specific content is loaded via <?=$this->lang->line('my_token1')?>

The file system looks like:

/public_html
    /mysite.com/index.php
    /mysite.pt/index.php
    /all_sites/application
    /all_sites/assets
    /all_sites/system

The index.php file in both site’s route directory changes system and application folder location:

$system_path = '/home/my_host/public_html/all_sites/system';
$application_folder = '/home/my_host/public_html/all_sites/application';

This setup works smoothly. But the CMS is old and doesn't create an updated sitemap whenever menu or content is changed.

So I thought about another way to provide an updated site/language-specific sitemap.xml: When the bots come scanning the site, the sitemap could be created on the fly and provide the bot with the most updated sitemap.xml

I resolved this by creating a controller method site_manager/sitemap() which parses the database entries and outputs a sitemap with echo $this->load->view('sitemap',$data,true);

which outputs, depending on the site:

<!-- created by mysite.pt, 2020-12-22 -->
<url>
  <loc>https:// mysite.pt/</loc>
  <lastmod>2020-12-22T20:53:36+00:00</lastmod>
  <priority>1.00</priority>
</url>
<url>
  <loc>https:// mysite.pt/praias.html</loc>
  <lastmod>2020-12-22T19:51:51+00:00</lastmod>
  <priority>0.80</priority>
</url>

Or

<!-- created by mysite.com, 2020-12-22 -->
<url>
  <loc>https:// mysite.com/</loc>
  <lastmod>2020-12-22T20:53:36+00:00</lastmod>
  <priority>1.00</priority>
</url>
<url>
  <loc>https:// mysite.com/beaches.html</loc>
  <lastmod>2020-12-22T19:51:51+00:00</lastmod>
  <priority>0.80</priority>
</url>

The question: The problem with this setup is that I won’t have any sitemap.xml in the root directory, as there is only an echoed output, once the controller function was used. The bots will go home empty handed, because of not finding any sitemap.xml, same if you type mysite.com/sitemap.xml.

How can I make the bot to access the controller method and consequently read the generated output?


Solution

  • Bots are looking for the sitemap.xml in the site’s root.

    To make a bot “read” the controller’s echoed output, it needs to be directed to the controller function, in our case to site_manager/sitemap().

    The trick is to make a .htaccess redirect towards the controller which creates the sitemap output, note site_manager is set as the default controller in routes.php:

    redirect sitemap.xml to sitemap.php

    RewriteRule ^sitemap\.xml$ sitemap.php [L]
    

    This means the bot which is trying to read the “non existing sitemap.xml” is redirected to the controller and fed via echo $this->load->view('sitemap',$data,true); with the output of the dynamic sitemap data on the fly, getting the most up-to-date sitemap xml data possible.

    You can test the successful sitemap creation, Typing e.g. https://mysite.pt/sitemap.xml in your browser:

    Note: you won’t find above created sitemap.xml file in your ftp directory listing of ftp://mysite.pt ! , since this file was never written or uploaded.

    You can also verify through the search consoles of the mayor directory listings like google, bing, etc. and confirm if a sitemap was submitted successfully by the bypassing bot