pythonweb-scraping

How does the website know that it's not my browser?


When I access the URL https://www.getfpv.com/media/sitemap.xml from my browser it works, but when I try to do it with Python, it returns 403 forbidden. How does the website know that it's python making the request instead of my browser? I copied all of the headers so the request should be identical. It's not javascript or cookies because when I turned those off on Safari it still worked. My code is below.

url = "https://www.getfpv.com/media/sitemap.xml"
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Priority': 'u=0, i',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36'
}
r = requests.get(url, headers=headers)
r
<Response [403]>

Solution

  • Looks like it is filtering for user agents. I am able to get it via:

    import requests
    
    sess = requests.Session()
    sess.headers['User-Agent'] = (
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:135.0) '
        'Gecko/20100101 Firefox/135.0'
    )
    res = sess.get('https://www.getfpv.com/media/sitemap.xml')
    res
    # returns:
    <Response [200]>
    

    enter image description here

    Behaviour seems to vary depending on the client origin.

    enter image description here