The requests module isn't working anymore for me when trying to scrape amazon, I've tried using cookies, headers, changing IP's but nothing really works other than scraping through a browser. Does anyone know how they're able to do it and if there's a good work around using requests?
The real odd thing is that the request when sent through cURL returns the page, but if I turn it into python code it returns a captcha request that I can't see in my browser and doesn't go away even with cookies.
For example this cURL request returns the Amazon main page, but when truend into python it returns a captcha request:
curl -L -vvv http://amazon.com -H "User-Agent:Mozilla 5.0"
This is my current code, I copied the curl request directly from the browser and turned into python code, still not working:
import requests
cookies = {
'session-id': '135-4585428-6195300',
'session-id-time': '2082787201l',
'i18n-prefs': 'USD',
'sp-cdn': '"L5Z9:IL"',
'ubid-main': '132-1503580-7678418',
'session-token': 'R5XVE3t8VeX8bRwnjuxXwONDgBnxkngfLfzobFxK5HL+8QaofrVEPjv8Mvta3D6EMlaiFeOyhjjiHkHLjjRwlh9seQ0wsfXE0BU0csh2Wtx6q6r630bsx5VvbBIQcyVAPRkgvL5wgU12P39t5iCZ7b3ykFjRvb9qe7eScZC/F9DJ+NuFMOVP+Z7OQtlZNQzcYrKmWTJH0HJZho8VtJBish0ATwfLhVI+Ihu1ioHYUfSUNDdjQFgG7SyiKZDufkXekZZGaF3x24vY9haBeJVnE9GjmMN+XHySuQtP/stlZmhlp9JOH17+JTZHVsCn/SEONdK5QhETXzoaQ+9YvptxA+v49bgXJn+L',
'csm-hit': 'tb:NBK78382HSSRXD9W22YX+s-SKXXAE4EMPQ2XYNGK1G0|1692968547644&t:1692968547644&adb:adblk_no',
}
headers = {
'authority': 'www.amazon.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
# 'cookie': 'session-id=135-4585428-6195300; session-id-time=2082787201l; i18n-prefs=USD; sp-cdn="L5Z9:IL"; ubid-main=132-1503580-7678418; session-token=R5XVE3t8VeX8bRwnjuxXwONDgBnxkngfLfzobFxK5HL+8QaofrVEPjv8Mvta3D6EMlaiFeOyhjjiHkHLjjRwlh9seQ0wsfXE0BU0csh2Wtx6q6r630bsx5VvbBIQcyVAPRkgvL5wgU12P39t5iCZ7b3ykFjRvb9qe7eScZC/F9DJ+NuFMOVP+Z7OQtlZNQzcYrKmWTJH0HJZho8VtJBish0ATwfLhVI+Ihu1ioHYUfSUNDdjQFgG7SyiKZDufkXekZZGaF3x24vY9haBeJVnE9GjmMN+XHySuQtP/stlZmhlp9JOH17+JTZHVsCn/SEONdK5QhETXzoaQ+9YvptxA+v49bgXJn+L; csm-hit=tb:NBK78382HSSRXD9W22YX+s-SKXXAE4EMPQ2XYNGK1G0|1692968547644&t:1692968547644&adb:adblk_no',
'device-memory': '8',
'downlink': '10',
'dpr': '1',
'ect': '4g',
'rtt': '100',
'sec-ch-device-memory': '8',
'sec-ch-dpr': '1',
'sec-ch-ua': '"Chromium";v="116", "Not)A;Brand";v="24", "Microsoft Edge";v="116"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-ch-ua-platform-version': '"10.0.0"',
'sec-ch-viewport-width': '1037',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.54',
'viewport-width': '1037',
}
response = requests.get('https://www.amazon.com/dp/B002G9UDYG', cookies=cookies, headers=headers)
I don't think that you can scrape Amazon with Python Requests
even if you try to use information extract from a valid browser session.
basic curl connection
curl -I http://www.amazon.com
The response below shows that the URL is using Amazon CloudFront
and has a status code of 301, which tell us that the URL is being permanently redirect to some other URL
HTTP/1.1 301 Moved Permanently
Server: CloudFront
Date: Wed, 30 Aug 2023 12:41:38 GMT
Content-Type: text/html
Content-Length: 167
Connection: keep-alive
Location: https://www.amazon.com/
X-Cache: Redirect from cloudfront
Via: 1.1 322b7a8ce3aa88236c8ca9410d0b9300.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: ATL58-P3
Alt-Svc: h3=":443"; ma=86400
X-Amz-Cf-Id: oK3dFCUCiQ6ZdAe_BEC5p-XbRxcrXFiYupSaYQOh6W1JS85BJsLrKA==
Python Requests
import requests
response = requests.get('https://www.amazon.com/')
print(response.status_code)
503
print(response.headers)
{'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Server': 'Server', 'Date': 'Wed, 30 Aug 2023 12:52:08 GMT', 'x-amz-rid': 'YXG2PT1GB1T7GY4Q19KC', 'Vary': 'Content-Type,Accept-Encoding,User-Agent', 'Last-Modified': 'Mon, 12 Jun 2023 22:17:25 GMT', 'ETag': '"a6f-5fdf615518740-gzip"', 'Accept-Ranges': 'bytes', 'Content-Encoding': 'gzip', 'Strict-Transport-Security': 'max-age=47474747; includeSubDomains; preload', 'X-Cache': 'Error from cloudfront', 'Via': '1.1 71cf657de17d1d4de9dbcb4ff38d54c0.cloudfront.net (CloudFront)', 'X-Amz-Cf-Pop': 'ATL56-P1', 'Alt-Svc': 'h3=":443"; ma=86400', 'X-Amz-Cf-Id': 'Rxe_ROuUee2QLLxW7e8tVqbJ4WwRK3JXbhjxrgV-WXwrb0q6pdzdbg=='}
The status code 503 indicates that the server is temporarily unable to handle the request. The headers show that Amazon CloudFront
is not allowing the connection.
If we exam the content of the page (response.text) you will see this:
To discuss automated access to Amazon data please contact api-services-support@amazon.com. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv
Based on the information Amazon is trying to prevent someone from scraping their site with tools, such as Python Requests
. I would recommend trying selenium
or Amazon's API.
Here are some sites that highlight how to use selenium
to scrape Amazon: