Just looking for ideas on how to deal with scraping bots causing DDOS style traffic surges.
This is a Drupal site with over 100k pages. There are around 1000 term pages which get updated quite often as node updates trigger a Varnish purge on those pages.
It would be fine under normal site operations but we get hit by scraping bots which want to crawl those pages as quickly as possible. Sometimes 40-50 IPs each hitting 20-30 URLs within a span of a minute or two. These IPs and servers change often and not possible to continuously block them.
Currently we have a TTL of 1 day on the path and pages get updated 4-5 times a day which means we often get caught off-guard by these bots hitting 600-700 of these uncached pages and bringing backend to a halt.
Things we have tried:
A higher grace period for this particular path which makes things worse since these bots get served expired pages really fast, they request even faster which makes things worse for the backend. Is there a way for Varnish to throttle backend fetches within grace period?
Request throttling doesn't work because they space the requests by spreading them across many IPs.
What are some other things that can be done to shield backend from a sudden surges? We have been advised to not purge cache on every update but that kills the whole point of running a site if we can't update content due to the fear of bots.
You could compile vmod_vsthrottle from source and throttle clients based on their client IP or certain headers.
Here's an example VCL snippet based on that VMOD:
vcl 4.1;
import vsthrottle;
backend default {
.host = "192.0.2.11";
.port = "8080";
}
sub vcl_recv {
if (vsthrottle.is_denied(client.identity, 15, 10s, 30s)) {
# Client has exceeded 15 reqs per 10s.
# When this happens, block altogether for the next 30s.
return (synth(429, "Too Many Requests"));
}
}
Varnish Enterprise has a Veribot feature that checks the User-Agent
header and performs reverse DNS resolution on the client IP to verify accepted.
This allows you to create a list of allowed and rejected bots whose verification goes beyond a basic User-Agent
check (which is easy to fake).
The output of this Veribot feature can be used to allow access from verified bots and block other bots.
While Veribot is capable of identifying trusted bots (like the Google crawler), even the non-malicious bots can have a detrimental impact on the performance of your application when cache misses occur.
Use Veribot to block access to unwanted bots, and then still rate limit requests that come through, both from bots and regular users.
vmod_vsthrottle
is open source. You can download it from https://github.com/varnish/varnish-modules and compile the source code.
Veribot is part of Varnish Enterprise, which is a commercial product. We also package vmod_vsthrottle
in our commercial software.
If you want to know more, just reach out to us: https://www.varnish-software.com/contact-us/. There are other features in there that would benefit your Drupal setup.