I'm writing a simple Perl script that fetches some pages from different sites. It's very non-intrusive. I don't hog a servers bandwidth. It retrieves a single page without loading any extra javascript, or images, or style sheets.
I use LWP::UserAgent to retrieve the pages. This works fine on most sites but there are some sites that return a "403 - Bad Request" error. The same pages load perfectly fine in my browser. I have inspected the request header from my webbrowser and copied that exactly when trying to retrieve the same page in Perl and every single time I get a 403 error. Here's a code snippet:
use strict;
use LWP::UserAgent;
use HTTP::Cookies;
my $URL = "https://www.betsson.com/en/casino/jackpots";
my $browserObj = LWP::UserAgent->new(
ssl_opts => { verify_hostname => 0 }
);
# $browserObj->cookie_jar( {} );
my $cookie_jar = HTTP::Cookies->new();
$browserObj->cookie_jar( $cookie_jar );
$browserObj->agent( "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0");
$browserObj->timeout(600);
push @{ $browserObj->requests_redirectable }, 'POST';
my @header = ( 'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding' => 'gzip, deflate, br',
'Accept-Language' => 'en-US,en;q=0.5',
'Connection' => 'keep-alive',
'DNT' => '1',
'Host' => 'www.bettson.com',
'Upgrade-Insecure-Requests' => '1'
);
my $response = $browserObj->get( $URL, @header );
if( $response->is_success ) {
print "Success!\n";
} else {
print "Unsuccessfull...\n";
}
How do these servers distinguish between a real browser and my script? At first I thought they had some JavaScript trickery going on, but then I realized in order for that to work, the page has to be loaded by a browser first. But I immediately get this 403 Error.
What can I do to debug this?
While 403 is a typical answer for bot detection, in this case the bot detection is not the cause of the problem. Instead a typo in your code is:
my $URL = "https://www.betsson.com/en/casino/jackpots";
...
'Host' => 'www.bettson.com',
In the URL the domain name is www.betsson.com
and this should be reflected in the Host
header. But your Host
header is slightly different: www.bettson.com
. Since the Host header has the wrong name the request is rejected with 403 forbidden.
And actually, it is not even needed to go through all this trouble since it looks like no bot detection is done at all. I.e. no need to set user-agent and fiddle with the headers but plain:
my $browserObj = LWP::UserAgent->new();
my $response = $browserObj->get($URL);