I'm am trying to scrape search results on Walmart.
For example, let's go to the domain "https://www.walmart.com/search/?query=coffee%20machine"
And try to extract just the text from the element with the class name search-product-result
, all in python.
I've tried selenium
and I get asked to verify my identity. I've tried requests
and I get the forbidden page from Walmart. I've tried other libraries and I'm running out of ideas. Any advice?
The data in this URL is being loaded by JavaScript. So beautifulsoup
will not work in this case.
However, the data that the page displays is present as JSON string inside <script>
tag with id=searchContent
in its HTML Code.
I have extracted that <script>
from the HTML code, did some stripping and converted the text to JSON.You can extract whatever data you need from that JSON.
Here is the code that prints the product IDs of the search results.
from bs4 import BeautifulSoup
import requests
import json
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
url = 'https://www.walmart.com/search?query=coffee%20machine'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
s = str(soup.find('script', {'id': 'searchContent'}))
s = s.strip('<script id="searchContent" type="application/json"></script>')
j = json.loads(s)
x = j['searchContent']['preso']['items']
for i in x:
print(i['productId'])
Outputs the product IDs.
2RYLQXVZ80E8
7EYUEQ82RMBP
7A3VDQNS5R36
22GRP3PGSY4A
238DLP3R0M3W
52NMIX2M8SC5
1R4H630LRNSE
.
.
.