I'm trying to scrape out some product specifications from some e-commerce website. So I have a list of URLs to various products, I need my code to go to each (this part is easy) and scrape out the product specs I need. I have been trying to use ParseHub — it works for some links but it does not for other. My suspicion is, for example, 'Wheel diameter' changes its location every time so it ends up grabbing wrong spec value.
One of such parts, for example, in HTML looks like this:
<div class="product-detail product-detail-custom-field">
<span class="product-detail-key">Wheel Diameter</span>
<span data-product-custom-field="">8 Inches</span>
</div>
What I think I could do is if I use BeautifulSoup and if I could somehow using smth like
if soup.find("span", class_ = "product-detail-key").text.strip()=="Wheel Diameter":
*go to the next line and grab the string inside*
How can I code this? I really apologize if my question sounds silly, pardon my ignorance, I'm pretty new to webscraping.
You can use .find_next()
function:
from bs4 import BeautifulSoup
html_doc = """
<div class="product-detail product-detail-custom-field">
<span class="product-detail-key">Wheel Diameter</span>
<span data-product-custom-field="">8 Inches</span>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
diameter = soup.find("span", text="Wheel Diameter").find_next("span").text
print(diameter)
Prints:
8 Inches
Or using CSS selector with +
:
diameter = soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") + *').text