I've been trying to perform web scraping on an Amazon product to extract the price. I've only used Requests to try to fetch the page data, but I always get the same error:
"To discuss automated access to Amazon data please contact api-services-support@amazon.com. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies."
What can I do?
import requests
header = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Upgrade-Insecure-Requests": "1",
"Referer": "https://www.google.com/",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
url = 'https://www.amazon.com/-/es/Revlon-One-Step-Volumizer-PLUS/dp/B096SVJZSW/?_encoding=UTF8&content-id=amzn1.sym.3f4ca281-e55c-46d1-9425-fb252d20366f&ref_=pd_gw_exports_top_sellers_unrec'
response = requests.get(url, headers=header)
data=response.text
print(data)
print(response.status_code)
Amazon adds cookies when you browse them to assure you work though a browser (just go into browsers developer mode and look at application/Cookies)
If you use requests directly it will not return any cookies:
#!/usr/bin/env python3
import requests
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'}
url = "https://www.amazon.com/-/es/Revlon-One-Step-Volumizer-PLUS/dp/B096SVJZSW/?_encoding=UTF8&content-id=amzn1.sym.3f4ca281-e55c-46d1-9425-fb252d20366f&ref_=pd_gw_exports_top_sellers_unrec"
r = s.get(url, headers=headers)
print(s.cookies.get_dict())
{}
So amazon has something in place to prevent python requests to work, even if you manipulate User-Agent.
Options:
1 - Amazon has a price api - https://developer-docs.amazon.com/sp-api/docs/product-pricing-api-v0-reference#getpricing with 0.5 queries per sec max rate in free usage plan. But this is recommended compared to a slow browser.
2 - You can use browsers, it is rather slow. But working with playwright is so fun:
#!/usr/bin/env python3
import time
from playwright.sync_api import sync_playwright
url = "https://www.amazon.com/-/es/Revlon-One-Step-Volumizer-PLUS/dp/B096SVJZSW/?_encoding=UTF8&content-id=amzn1.sym.3f4ca281-e55c-46d1-9425-fb252d20366f&ref_=pd_gw_exports_top_sellers_unrec"
with sync_playwright() as p:
t0 = time.time()
browser = p.chromium.launch(headless=True) # just so you know how to get it headfull for debugging
page = browser.new_page()
page.goto(url)
#print(page.title())
price = page.locator("(//span[@class='a-price a-text-price']/span[@class='a-offscreen'])[1]").text_content()
print(f"{price} in {time.time()-t0:.2f}sec")
#page.pause() #do not close browser in full mode for debug
browser.close()
US$39.97 in 10.26sec