I have an online archive of images, some of which reside on Cloud Storage. The archive is hierarchical with four levels, and the appropriate level is accessed using query strings:
a.php?level=image&collection=a&document=b&item=72
The level can be archive, collection, document, or image.
I want to prevent robots from accessing the actual images, primarily to minimise traffic on the cloud storage. So the idea is if they issue a request where the query string level parameter is image ("?level=image"), that request is diverted.
The .htaccess
code below is intended to check the query string for a request from a foreign referrer, and if the request is for an image, direct the request elsewhere:
RewriteEngine On
RewriteCond %{HTTP_HOST}@@%{HTTP_REFERER} !^([^@]*)@@https?://\1
RewriteCond %{QUERY_STRING} ^level=image$
RewriteRule (.*) https://a.co.uk/blank.htm [NC,R,L]
My code appears to have no obvious effect. Can anybody see what I am doing wrong? I do not pretend to have a lot of confidence with .htaccess
code, normally relying on snippets produced by people cleverer than me.
RewriteCond %{QUERY_STRING} ^level=image$
This checks that the query string is exactly equal to level=image
, whereas in your example the level
URL parameter is just one of many (the first one).
To check that the URL parameter level=image
appears anywhere in the query string then modify the above condition like so:
RewriteCond %{QUERY_STRING} (^|&)level=image($|&)
RewriteCond %{HTTP_HOST}@@%{HTTP_REFERER} !^([^@]*)@@https?://\1
Minor issue, but this would allow referrers where the requested hostname (eg. example.com
) occurs only as a subdomain of the referrer. eg. example.com.referrer.com
. To resolve this, modify the CondPattern to include a trailing slash or end-of-string anchor. For example:
RewriteCond %{HTTP_HOST}@@%{HTTP_REFERER} !^([^@]*)@@https?://\1(/|$)
RewriteRule (.*) https://a.co.uk/blank.htm [NC,R,L]
There's no need for the capturing subpattern. If you only need the rule to be successful for any URL-path then use just ^
to avoid traversing the URL-path. But in your example, the request is for a.php
, not "any URL"?
But why "redirect", rather than simply block the request? As you say, this is for "robots" after all. For example, to send a 403 Forbidden:
RewriteRule ^a\.php$ - [F]
RewriteCond %{HTTP_HOST}@@%{HTTP_REFERER} !^([^@]*)@@https?://\1(/|$)
RewriteCond %{QUERY_STRING} (^|&)level=image($|&)
RewriteRule ^a\.php$ - [F]
Note, however, that search engine "bots" generally don't send a Referer
header at all. And it is trivial for arbitrary bots to fake the Referer
header and circumvent your block.