python-requests

Web scraping from Amazon is generating an error


I've been trying to perform web scraping on an Amazon product to extract the price. I've only used Requests to try to fetch the page data, but I always get the same error:

"To discuss automated access to Amazon data please contact api-services-support@amazon.com. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.

Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies."

What can I do?

import requests

header = { 
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-US,en;q=0.9", 
    "Upgrade-Insecure-Requests": "1", 
    "Referer": "https://www.google.com/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
url = 'https://www.amazon.com/-/es/Revlon-One-Step-Volumizer-PLUS/dp/B096SVJZSW/?_encoding=UTF8&content-id=amzn1.sym.3f4ca281-e55c-46d1-9425-fb252d20366f&ref_=pd_gw_exports_top_sellers_unrec'

response = requests.get(url, headers=header)

data=response.text
print(data)
print(response.status_code)

Solution

  • Amazon adds cookies when you browse them to assure you work though a browser (just go into browsers developer mode and look at application/Cookies)

    If you use requests directly it will not return any cookies:

    #!/usr/bin/env python3
    import requests
    s = requests.Session()
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'}
    url = "https://www.amazon.com/-/es/Revlon-One-Step-Volumizer-PLUS/dp/B096SVJZSW/?_encoding=UTF8&content-id=amzn1.sym.3f4ca281-e55c-46d1-9425-fb252d20366f&ref_=pd_gw_exports_top_sellers_unrec"
    r = s.get(url, headers=headers)
    print(s.cookies.get_dict())
    

    {}

    So amazon has something in place to prevent python requests to work, even if you manipulate User-Agent.

    Options:

    1 - Amazon has a price api - https://developer-docs.amazon.com/sp-api/docs/product-pricing-api-v0-reference#getpricing with 0.5 queries per sec max rate in free usage plan. But this is recommended compared to a slow browser.

    2 - You can use browsers, it is rather slow. But working with playwright is so fun:

    #!/usr/bin/env python3
    import time
    from playwright.sync_api import sync_playwright
    
    url = "https://www.amazon.com/-/es/Revlon-One-Step-Volumizer-PLUS/dp/B096SVJZSW/?_encoding=UTF8&content-id=amzn1.sym.3f4ca281-e55c-46d1-9425-fb252d20366f&ref_=pd_gw_exports_top_sellers_unrec"
    
    with sync_playwright() as p:
        t0 = time.time()
        browser = p.chromium.launch(headless=True) # just so you know how to get it headfull for debugging
        page = browser.new_page()
        page.goto(url)
        #print(page.title())
        price = page.locator("(//span[@class='a-price a-text-price']/span[@class='a-offscreen'])[1]").text_content()
        print(f"{price} in {time.time()-t0:.2f}sec")
        #page.pause() #do not close browser in full mode for debug
        browser.close()
    

    US$39.97 in 10.26sec

    https://playwright.dev/python/