.htaccesshttp-headersapache2ubuntu-12.04canonical-link

Canonical Header Links for PDF and Image files in .htaccess


I'm attempting to setup Canonical links for a number of PDF and images files on my website.

Example Folder Structure:

/index.php
/docs/
    file.pdf
    /folder1/
        file.pdf
    /folder2/
        file1.pdf
        file2.pdf
/img/
    sprite.png
    /slideshow/
        slide1.jpg
        slide2.jpg

Example PDF URL to Canonical URL: http://www.example.com/docs/folder1/file.pdf --> http://www.example.com/products/folder1/

I am trying to avoid having to put individual .htaccess files in each of the sub-folders that contain all of my images and PDFs. I currently have 7 "main" folders, and each of these folders have any where from 2-10 sub-folders, and most sub-folders have their own sub-folders. I have roughly 80 PDFs, and even more images.

I'm looking for a (semi)dynamic solution where all files in a certain folder will have the Canonical Link set to a single url. I want to keep as much as possible in a single .htaccess file.

I know that <Files> and <FilesMatch> do not understand paths, and that <Directory> and <DirectoryMatch> don't work in .htaccess files.

Is there a fairly simple way to accomplish this?


Solution

  • I don't know of a way to solve this with apache rules alone as it would require some sort of regex matching and reusing the result of the match in a directive, which isn't possible.

    However, it's pretty simple if you introduce a php script into the mix:

    RewriteEngine On
    RewriteCond %{REQUEST_URI} \.(jpg|png|pdf)$
    RewriteRule (.*) /canonical-header.php?path=$1
    

    Note that this would send requests for all jpg, png and pdf files to the script regardless of the folder name. If you want to include only specific folders, you could add another RewriteCond to accomplish that.

    Now the canonical-header.php script:

    <?php
    
    // Checking for the presence of the path variable in the query string allows us to easily 404 any requests that
    // come directly to this script, just to be safe.
    if (!empty($_GET['path'])) {
        // Be sure to add any new file types you want to handle here so the correct content-type header will be sent.
        $mimeTypes = array(
            'pdf' => 'application/pdf',
            'jpg' => 'image/jpeg',
            'png' => 'image/png',
        );
    
        $path         = filter_input(INPUT_GET, 'path', FILTER_SANITIZE_URL);
        $file         = realpath($path);
        $extension    = pathinfo($path, PATHINFO_EXTENSION);
        $canonicalUrl = 'http://' . $_SERVER['HTTP_HOST'] . '/' . dirname($path);
        $type         = $mimeTypes[$extension];
    
        // Verify that the file exists and is readable, or send 404
        if (is_readable($file)) {
            header('Content-Type: ' . $type);
            header('Link <' . $canonicalUrl . '>; rel="canonical"');
            readfile(realpath($path));
        } else {
            header('HTTP/1.0 404 Not Found');
            echo "File not found";
        }
    } else {
        header('HTTP/1.0 404 Not Found');
        echo "File not found";
    }
    

    Please consider this code untested and check that it works as expected across browsers before releasing it to production.