pythonweb-scrapingpython-requests

Problem scraping Amazon using requests: I get blocked even when using cookie and headers. I can only scrape using a browser. Any solution?


The requests module isn't working anymore for me when trying to scrape amazon, I've tried using cookies, headers, changing IP's but nothing really works other than scraping through a browser. Does anyone know how they're able to do it and if there's a good work around using requests?

The real odd thing is that the request when sent through cURL returns the page, but if I turn it into python code it returns a captcha request that I can't see in my browser and doesn't go away even with cookies.

For example this cURL request returns the Amazon main page, but when truend into python it returns a captcha request:

curl -L -vvv http://amazon.com -H "User-Agent:Mozilla 5.0"

This is my current code, I copied the curl request directly from the browser and turned into python code, still not working:

import requests

cookies = {
    'session-id': '135-4585428-6195300',
    'session-id-time': '2082787201l',
    'i18n-prefs': 'USD',
    'sp-cdn': '"L5Z9:IL"',
    'ubid-main': '132-1503580-7678418',
    'session-token': 'R5XVE3t8VeX8bRwnjuxXwONDgBnxkngfLfzobFxK5HL+8QaofrVEPjv8Mvta3D6EMlaiFeOyhjjiHkHLjjRwlh9seQ0wsfXE0BU0csh2Wtx6q6r630bsx5VvbBIQcyVAPRkgvL5wgU12P39t5iCZ7b3ykFjRvb9qe7eScZC/F9DJ+NuFMOVP+Z7OQtlZNQzcYrKmWTJH0HJZho8VtJBish0ATwfLhVI+Ihu1ioHYUfSUNDdjQFgG7SyiKZDufkXekZZGaF3x24vY9haBeJVnE9GjmMN+XHySuQtP/stlZmhlp9JOH17+JTZHVsCn/SEONdK5QhETXzoaQ+9YvptxA+v49bgXJn+L',
    'csm-hit': 'tb:NBK78382HSSRXD9W22YX+s-SKXXAE4EMPQ2XYNGK1G0|1692968547644&t:1692968547644&adb:adblk_no',
}

headers = {
    'authority': 'www.amazon.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'max-age=0',
    # 'cookie': 'session-id=135-4585428-6195300; session-id-time=2082787201l; i18n-prefs=USD; sp-cdn="L5Z9:IL"; ubid-main=132-1503580-7678418; session-token=R5XVE3t8VeX8bRwnjuxXwONDgBnxkngfLfzobFxK5HL+8QaofrVEPjv8Mvta3D6EMlaiFeOyhjjiHkHLjjRwlh9seQ0wsfXE0BU0csh2Wtx6q6r630bsx5VvbBIQcyVAPRkgvL5wgU12P39t5iCZ7b3ykFjRvb9qe7eScZC/F9DJ+NuFMOVP+Z7OQtlZNQzcYrKmWTJH0HJZho8VtJBish0ATwfLhVI+Ihu1ioHYUfSUNDdjQFgG7SyiKZDufkXekZZGaF3x24vY9haBeJVnE9GjmMN+XHySuQtP/stlZmhlp9JOH17+JTZHVsCn/SEONdK5QhETXzoaQ+9YvptxA+v49bgXJn+L; csm-hit=tb:NBK78382HSSRXD9W22YX+s-SKXXAE4EMPQ2XYNGK1G0|1692968547644&t:1692968547644&adb:adblk_no',
    'device-memory': '8',
    'downlink': '10',
    'dpr': '1',
    'ect': '4g',
    'rtt': '100',
    'sec-ch-device-memory': '8',
    'sec-ch-dpr': '1',
    'sec-ch-ua': '"Chromium";v="116", "Not)A;Brand";v="24", "Microsoft Edge";v="116"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-ch-ua-platform-version': '"10.0.0"',
    'sec-ch-viewport-width': '1037',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.54',
    'viewport-width': '1037',
}

response = requests.get('https://www.amazon.com/dp/B002G9UDYG', cookies=cookies, headers=headers)

Solution

  • I don't think that you can scrape Amazon with Python Requests even if you try to use information extract from a valid browser session.

    basic curl connection

    curl -I http://www.amazon.com

    The response below shows that the URL is using Amazon CloudFront and has a status code of 301, which tell us that the URL is being permanently redirect to some other URL

    HTTP/1.1 301 Moved Permanently
    Server: CloudFront
    Date: Wed, 30 Aug 2023 12:41:38 GMT
    Content-Type: text/html
    Content-Length: 167
    Connection: keep-alive
    Location: https://www.amazon.com/
    X-Cache: Redirect from cloudfront
    Via: 1.1 322b7a8ce3aa88236c8ca9410d0b9300.cloudfront.net (CloudFront)
    X-Amz-Cf-Pop: ATL58-P3
    Alt-Svc: h3=":443"; ma=86400
    X-Amz-Cf-Id: oK3dFCUCiQ6ZdAe_BEC5p-XbRxcrXFiYupSaYQOh6W1JS85BJsLrKA==
    

    Python Requests

    import requests
    response = requests.get('https://www.amazon.com/')
    
    print(response.status_code)
    503 
    
    print(response.headers)
    {'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Server': 'Server', 'Date': 'Wed, 30 Aug 2023 12:52:08 GMT', 'x-amz-rid': 'YXG2PT1GB1T7GY4Q19KC', 'Vary': 'Content-Type,Accept-Encoding,User-Agent', 'Last-Modified': 'Mon, 12 Jun 2023 22:17:25 GMT', 'ETag': '"a6f-5fdf615518740-gzip"', 'Accept-Ranges': 'bytes', 'Content-Encoding': 'gzip', 'Strict-Transport-Security': 'max-age=47474747; includeSubDomains; preload', 'X-Cache': 'Error from cloudfront', 'Via': '1.1 71cf657de17d1d4de9dbcb4ff38d54c0.cloudfront.net (CloudFront)', 'X-Amz-Cf-Pop': 'ATL56-P1', 'Alt-Svc': 'h3=":443"; ma=86400', 'X-Amz-Cf-Id': 'Rxe_ROuUee2QLLxW7e8tVqbJ4WwRK3JXbhjxrgV-WXwrb0q6pdzdbg=='}
    

    The status code 503 indicates that the server is temporarily unable to handle the request. The headers show that Amazon CloudFront is not allowing the connection.

    If we exam the content of the page (response.text) you will see this:

    To discuss automated access to Amazon data please contact api-services-support@amazon.com. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv

    Based on the information Amazon is trying to prevent someone from scraping their site with tools, such as Python Requests. I would recommend trying selenium or Amazon's API.

    Here are some sites that highlight how to use selenium to scrape Amazon: