pythonpython-3.xseleniumweb-scrapingpython-requests

Scrape Walmart search results python


I'm am trying to scrape search results on Walmart.

For example, let's go to the domain "https://www.walmart.com/search/?query=coffee%20machine"

And try to extract just the text from the element with the class name search-product-result, all in python.

I've tried selenium and I get asked to verify my identity. I've tried requests and I get the forbidden page from Walmart. I've tried other libraries and I'm running out of ideas. Any advice?


Solution

  • The data in this URL is being loaded by JavaScript. So beautifulsoup will not work in this case.

    However, the data that the page displays is present as JSON string inside <script> tag with id=searchContent in its HTML Code.

    I have extracted that <script> from the HTML code, did some stripping and converted the text to JSON.You can extract whatever data you need from that JSON.

    Here is the code that prints the product IDs of the search results.

    from bs4 import BeautifulSoup
    import requests
    import json
    
    headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
    url = 'https://www.walmart.com/search?query=coffee%20machine'
    
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    s = str(soup.find('script', {'id': 'searchContent'}))
    s = s.strip('<script id="searchContent" type="application/json"></script>')
    j = json.loads(s)
    x = j['searchContent']['preso']['items']
    
    
    for i in x:
        print(i['productId'])
    

    Outputs the product IDs.

    2RYLQXVZ80E8
    7EYUEQ82RMBP
    7A3VDQNS5R36
    22GRP3PGSY4A
    238DLP3R0M3W
    52NMIX2M8SC5
    1R4H630LRNSE
    .
    .
    .