I'm using the following IIS Rewrite Rule to block as many bots as possible.
<rule name="BotBlock" stopProcessing="true">
<match url=".*" />
<conditions>
<add input="{HTTP_USER_AGENT}" pattern="^$|\b(?!.*googlebot.*\b)\w*(?:bot|crawl|spider)\w*" />
</conditions>
<action type="CustomResponse" statusCode="403" statusReason="Forbidden" statusDescription="Forbidden" />
</rule>
The goal is to block all user agents with the parts bot, crawl or spider in it, but allow the Google Bot. This works to an extend. But the problem is that the second part of the regex is also triggered, even if "googlebot" is found in the string.
Below some examples what mean:
Googlebot/2.1 (+http://www.google.com)
Works fine, the 'bot' part in googlebot is ignored and the request is permitted.
Googlebot/2.1 (+http://www.google.com/bot.html)
Does not work, still triggers on the second 'bot' in the string and the request is blocked
KHTML, like Gecko; compatible; bingbot
Works fine, is triggered on the bot in bingbot and the request is blocked
So can someone help me to change the rexeg so the string with Googlebot/2.1 (+http://www.google.com/bot.html)
is allowed?
I'm not familiar with IIS's exact regex flavor (presumably .NET) but this should work if you can enable case-insensitive regex'ing:
^(?!.*googlebot).*(?:bot|crawl|spider)
Explanation:
^
- start line anchor(?!.*googlebot)
- ahead of me, the word "googlebot" does not exist.*(?:bot|crawl|spider)
- capture everything leading up to a positive match of the word "bot", "crawl", or "spider"The combination of negative look-ahead and positive forward capturing produces an implicit and
condition in regex; both must be true in order for the regex to register a match.
https://regex101.com/r/ri6Qs7/1
To note: I am not sure why your regex starts with ^$|
unless you are purposely looking to provide a 403 to requests with an empty user agent.