I’m trying to scrape a website using scrapy
which has some kind of bot protection. HTTP requests need to be made with a certain combination of headers. Otherwise requests time out or are refused with 403 Forbidden error codes.
What options do I have to set in scrapy
which correspond to the --compressed
flag of curl
?
This request succeeds (at the time of writing):
curl 'https://www.douglas.es/api/v2/stores?fields=FULL&pageSize=1000&sort=asc' \ (alltheplaces)
--compressed \
-H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8' \
-H 'Accept-Language: en-US,en;q=0.8,de-DE;q=0.5,de;q=0.3'
The same without the option --compressed
is denied with “HTTP Error 400. The size of the request headers is too long.”
Opening the URL in a browser works as well.
How can I make the same request as the cURL command in scrapy?
I’ve tried this:
COMPRESSION_ENABLED
But the requests get denied with a 403 error:
from scrapy import Request, Spider
class DouglasSpider(Spider):
name = "douglas"
user_agent = "User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0"
custom_settings = {"ROBOTSTXT_OBEY": False, "COMPRESSION_ENABLED": True}
def start_requests(self):
yield Request(
f"https://www.douglas.es/api/v2/stores?fields=FULL&pageSize=1000&sort=asc",
headers={
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.8,de-DE;q=0.5,de;q=0.3",
},
)
I’ve also tried to switch IPs or to use a VPN, and to clear scrapy’s HTTP cache, neither of which seemed to make a difference.
curl --compressed
just sets an Accept-Encoding
header with the compressors that your specific curl version supports. Scrapy has a similar behavior by default, because HttpCompressionMiddleware
is enabled by default, you don't need to do anything except maybe install brotli
and zstandard
to get more supported compressors.
(Note that this is unlikely to solve your original problem, as its cause is different, but it answers the question you asked, and "How can I make the same request as the cURL command in scrapy?" is a much wider question for which the general answer is "you can't")