apache.htaccessmod-rewritereferrerhttp-referer

.htaccess redirects according to the query string contents


I have an online archive of images, some of which reside on Cloud Storage. The archive is hierarchical with four levels, and the appropriate level is accessed using query strings:

a.php?level=image&collection=a&document=b&item=72

The level can be archive, collection, document, or image.

I want to prevent robots from accessing the actual images, primarily to minimise traffic on the cloud storage. So the idea is if they issue a request where the query string level parameter is image ("?level=image"), that request is diverted.

The .htaccess code below is intended to check the query string for a request from a foreign referrer, and if the request is for an image, direct the request elsewhere:

  RewriteEngine On
  RewriteCond %{HTTP_HOST}@@%{HTTP_REFERER} !^([^@]*)@@https?://\1
  RewriteCond %{QUERY_STRING} ^level=image$
  RewriteRule (.*) https://a.co.uk/blank.htm [NC,R,L]

My code appears to have no obvious effect. Can anybody see what I am doing wrong? I do not pretend to have a lot of confidence with .htaccess code, normally relying on snippets produced by people cleverer than me.


Solution

  • RewriteCond %{QUERY_STRING} ^level=image$
    

    This checks that the query string is exactly equal to level=image, whereas in your example the level URL parameter is just one of many (the first one).

    To check that the URL parameter level=image appears anywhere in the query string then modify the above condition like so:

    RewriteCond %{QUERY_STRING} (^|&)level=image($|&)
    
    RewriteCond %{HTTP_HOST}@@%{HTTP_REFERER} !^([^@]*)@@https?://\1
    

    Minor issue, but this would allow referrers where the requested hostname (eg. example.com) occurs only as a subdomain of the referrer. eg. example.com.referrer.com. To resolve this, modify the CondPattern to include a trailing slash or end-of-string anchor. For example:

    RewriteCond %{HTTP_HOST}@@%{HTTP_REFERER} !^([^@]*)@@https?://\1(/|$)
    
    RewriteRule (.*) https://a.co.uk/blank.htm [NC,R,L]
    

    There's no need for the capturing subpattern. If you only need the rule to be successful for any URL-path then use just ^ to avoid traversing the URL-path. But in your example, the request is for a.php, not "any URL"?

    But why "redirect", rather than simply block the request? As you say, this is for "robots" after all. For example, to send a 403 Forbidden:

    RewriteRule ^a\.php$ - [F]
    

    In summary:

    RewriteCond %{HTTP_HOST}@@%{HTTP_REFERER} !^([^@]*)@@https?://\1(/|$)
    RewriteCond %{QUERY_STRING} (^|&)level=image($|&)
    RewriteRule ^a\.php$ - [F]
    

    Note, however, that search engine "bots" generally don't send a Referer header at all. And it is trivial for arbitrary bots to fake the Referer header and circumvent your block.