If I visit https://doi.org/10.1002/jccs.200600142
with my browser, everything is fine. But both requests fail:
python -c "import requests; print(requests.head('https://doi.org/10.1002/jccs.200600142', allow_redirects=True))"
<Response [403]>
I also tried accepting cookies and changing the user-agent, which also did not help:
import requests
with requests.Session() as s:
print(s.get('https://doi.org/10.1002/jccs.200600142', allow_redirects=True, headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}))
<Response [403]>
Does someone know what requests
does differently than Firefox? Or should I include more headers?
I had the same problem with accessing doi.org URLs.
Finally I discovered that it needed additional HTTP headers to prevent the server from forbidding it. Specifically, all 3 of
Accept-Language
,Sec-Fetch-Site
, andUser-Agent
must be there or it will give you a 403 status code.
import requests
h = {}
h["Accept-Language"] = "en-US"
h["Sec-Fetch-Site"] = "cross-site"
h["User-Agent"] = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:138.0) Gecko/20100101 Firefox/138.0"
url = "https://doi.org/10.1111/tgis.70037"
response = requests.get(url, headers=h)
response.status_code # 200