I'm learning web scraping with Python and as a learning project I'm trying to extract all the products and their prices from a supermarket website.
This supermarket has more than 100 categories of products. This is the page of one category:
As you can see, some products have discount prices and they are not loaded at the first load of the page, so they are dynamically loaded after.
I could handle that by using Selenium and a Webdriver with a waiting time of a couple of seconds, like this:
import requests
import json
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
def getHtmlDynamic(url, time_wait):
driver = webdriver.Chrome()
driver.get(url)
time.sleep(time_wait)
soup = BeautifulSoup(driver.page_source, 'html5lib')
driver.quit()
return html
def getProductsAndPrices(html):
prodsJson = html.find_all('script', {'type': 'application/ld+json'})
dfProds = pd.json_normalize(json.loads(prodsJson[1].contents[0])['itemListElement'])
pricesList = html.find_all('div', {'class': 'contenedor-precio'})
prices = []
for row in pricesList:
price_row = row.find_all('span')
for price in price_row:
priceFinal = price.text
prices.append(priceFinal)
pricesFinalList = prices[:dfProds.shape[0]]
dfProds['price'] = pricesFinalList
return dfProds
htmlProducts = getHtmlDynamic(url='https://www.vea.com.ar/electro/aire-acondicionado-y-ventilacion', time_wait=20)
dfProds = getProductsAndPrices(htmlProducts)
This works well for one specific category, but when I tried to scale it to more categories (10 for example) with a for loop, it crashes. The dynamic content is not correctly loaded after the second iteration.
dfProductsConsolidated = pd.DataFrame([])
for category in dfCategories['categoryURL'][:10]:
htmlProducts = getHtmlDynamic(url=category, time_wait=20)
dfProds = getProductsAndPrices(htmlProducts)
dfProductsConsolidated = dfProductsConsolidated.append(dfProds)
Is there any way to handle this kind of scraping at a large scale? any best practices that can help me with this?
Thanks in advance!
To speed up the loading of pages I suggest to start the driver in headless mode and with images disabled.
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
options.add_argument('--blink-settings=imagesEnabled=false')
driver = webdriver.Chrome(options=options)
The following code scrapes data for all the products in the 10 categories. The code clicks the button "Mostrar más" (show more) if it is present, so that all the products are loaded. The execution took about 14 minutes on my computer, and it did not crash. It was so slow because the category "Almacen/Desayuno-y-Merienda" contains 800 products.
Data (items and prices) are stored in a dictionary, and each category has a separate dictionary. All the dictionraties are stored in another dictionary called data
.
from selenium.common.exceptions import ElementClickInterceptedException, StaleElementReferenceException
urls = '''https://www.vea.com.ar/Electro/aire-acondicionado-y-ventilacion
https://www.vea.com.ar/Almacen/Aceites-y-Vinagres
https://www.vea.com.ar/Almacen/Desayuno-y-Merienda
https://www.vea.com.ar/Lacteos/Leches
https://www.vea.com.ar/Frutas-y-Verduras/Frutas
https://www.vea.com.ar/Bebes-y-Ninos/Jugueteria
https://www.vea.com.ar/Quesos-y-Fiambres/Fiambres
https://www.vea.com.ar/Panaderia-y-Reposteria/Panaderia
https://www.vea.com.ar/Mascotas/Perros
https://www.vea.com.ar/Bebidas/Gaseosas'''.split('\n')
categories = [url.split('.ar/')[-1] for url in urls]
data = {key:{} for key in categories}
for idx,category in enumerate(categories):
info = f'[{idx+1}/{len(categories)}] {category} '
print(info, end='')
driver.get('https://www.vea.com.ar/' + category)
number_of_products = 0
while number_of_products == 0:
footer = WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'p.text-content')))
number_of_products = int(footer.text.split()[3])
number_of_loaded_products = int(footer.text.split()[1])
print(f'(loaded products={number_of_loaded_products}, total={number_of_products})', end='\r')
while number_of_loaded_products < number_of_products:
footer = driver.find_element(By.CSS_SELECTOR, 'p.text-content')
driver.execute_script('arguments[0].scrollIntoView({block: "center"});', footer)
show_more = driver.find_elements(By.XPATH, "//div[text()='Mostrar más']")
if show_more:
try:
show_more[0].click()
except (ElementClickInterceptedException, StaleElementReferenceException):
continue
number_of_loaded_products = int(footer.text.split()[1])
print(info + f'(loaded products={number_of_loaded_products}, total={number_of_products})', end='\r')
time.sleep(1)
loaded_products = json.loads(driver.find_element(By.CSS_SELECTOR, "body script[type='application/ld+json']").get_attribute('innerText'))['itemListElement']
products = {'item':[],'price':[]}
for prod in loaded_products:
products['item'] += [prod['item']['name']]
products['price'] += [prod['item']['offers']['offers'][0]['price']]
data[category] = products
print()
The code prints various info while looping, and in the end you have something like this
[1/10] Electro/aire-acondicionado-y-ventilacion (loaded products=7, total=7)
[2/10] Almacen/Aceites-y-Vinagres (loaded products=87, total=87)
[3/10] Almacen/Desayuno-y-Merienda (loaded products=808, total=808)
[4/10] Lacteos/Leches (loaded products=80, total=80)
[5/10] Frutas-y-Verduras/Frutas (loaded products=70, total=70)
[6/10] Bebes-y-Ninos/Jugueteria (loaded products=57, total=57)
[7/10] Quesos-y-Fiambres/Fiambres (loaded products=19, total=19)
[8/10] Panaderia-y-Reposteria/Panaderia (loaded products=17, total=17)
[9/10] Mascotas/Perros (loaded products=66, total=66)
[10/10] Bebidas/Gaseosas (loaded products=64, total=64)
To visualize the scraped data you can run pd.DataFrame(data[categories[idx]])
where idx
is an integer from 0
to len(categories)-1
. For example for idx=1
you get