pythonselenium-webdriverweb-scrapingbeautifulsoupdynamic-content

How to scrape a website that has dynamic content in multiple pages or categories using python


I'm learning web scraping with Python and as a learning project I'm trying to extract all the products and their prices from a supermarket website.

This supermarket has more than 100 categories of products. This is the page of one category:

Link

As you can see, some products have discount prices and they are not loaded at the first load of the page, so they are dynamically loaded after.

I could handle that by using Selenium and a Webdriver with a waiting time of a couple of seconds, like this:

import requests
import json
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time

def getHtmlDynamic(url, time_wait):
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(time_wait)
    soup = BeautifulSoup(driver.page_source, 'html5lib')
    driver.quit()

    return html

def getProductsAndPrices(html):
    prodsJson = html.find_all('script', {'type': 'application/ld+json'})
    dfProds = pd.json_normalize(json.loads(prodsJson[1].contents[0])['itemListElement'])
    
    pricesList = html.find_all('div', {'class': 'contenedor-precio'})
    prices = []

    for row in pricesList:
        price_row = row.find_all('span')
        for price in price_row:
            priceFinal = price.text
            prices.append(priceFinal)
    
    pricesFinalList = prices[:dfProds.shape[0]]
    
    dfProds['price'] = pricesFinalList

    return dfProds

htmlProducts = getHtmlDynamic(url='https://www.vea.com.ar/electro/aire-acondicionado-y-ventilacion', time_wait=20)
    
dfProds = getProductsAndPrices(htmlProducts)

This works well for one specific category, but when I tried to scale it to more categories (10 for example) with a for loop, it crashes. The dynamic content is not correctly loaded after the second iteration.

dfProductsConsolidated = pd.DataFrame([])

for category in dfCategories['categoryURL'][:10]:
    htmlProducts = getHtmlDynamic(url=category, time_wait=20)
    
    dfProds = getProductsAndPrices(htmlProducts)
    
    dfProductsConsolidated = dfProductsConsolidated.append(dfProds)

Is there any way to handle this kind of scraping at a large scale? any best practices that can help me with this?

Thanks in advance!


Solution

  • To speed up the loading of pages I suggest to start the driver in headless mode and with images disabled.

    options = webdriver.ChromeOptions()
    options.add_argument("--headless=new")
    options.add_argument('--blink-settings=imagesEnabled=false')
    
    driver = webdriver.Chrome(options=options)
    

    The following code scrapes data for all the products in the 10 categories. The code clicks the button "Mostrar más" (show more) if it is present, so that all the products are loaded. The execution took about 14 minutes on my computer, and it did not crash. It was so slow because the category "Almacen/Desayuno-y-Merienda" contains 800 products.

    Data (items and prices) are stored in a dictionary, and each category has a separate dictionary. All the dictionraties are stored in another dictionary called data.

    from selenium.common.exceptions import ElementClickInterceptedException, StaleElementReferenceException
    
    urls = '''https://www.vea.com.ar/Electro/aire-acondicionado-y-ventilacion
    https://www.vea.com.ar/Almacen/Aceites-y-Vinagres
    https://www.vea.com.ar/Almacen/Desayuno-y-Merienda
    https://www.vea.com.ar/Lacteos/Leches
    https://www.vea.com.ar/Frutas-y-Verduras/Frutas
    https://www.vea.com.ar/Bebes-y-Ninos/Jugueteria
    https://www.vea.com.ar/Quesos-y-Fiambres/Fiambres
    https://www.vea.com.ar/Panaderia-y-Reposteria/Panaderia
    https://www.vea.com.ar/Mascotas/Perros
    https://www.vea.com.ar/Bebidas/Gaseosas'''.split('\n')
    
    categories = [url.split('.ar/')[-1] for url in urls]
    data = {key:{} for key in categories}
    
    for idx,category in enumerate(categories):
        info = f'[{idx+1}/{len(categories)}] {category} '
        print(info, end='')
        driver.get('https://www.vea.com.ar/' + category)
        
        number_of_products = 0
        while number_of_products == 0:
            footer = WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'p.text-content')))
            number_of_products = int(footer.text.split()[3])
            number_of_loaded_products = int(footer.text.split()[1])
        print(f'(loaded products={number_of_loaded_products}, total={number_of_products})', end='\r')
        
        while number_of_loaded_products < number_of_products:
            footer = driver.find_element(By.CSS_SELECTOR, 'p.text-content')
            driver.execute_script('arguments[0].scrollIntoView({block: "center"});', footer)
            show_more = driver.find_elements(By.XPATH, "//div[text()='Mostrar más']")
            if show_more:
                try:
                    show_more[0].click()
                except (ElementClickInterceptedException, StaleElementReferenceException):
                    continue
            number_of_loaded_products = int(footer.text.split()[1])
            print(info + f'(loaded products={number_of_loaded_products}, total={number_of_products})', end='\r')
            time.sleep(1)
    
        loaded_products = json.loads(driver.find_element(By.CSS_SELECTOR, "body script[type='application/ld+json']").get_attribute('innerText'))['itemListElement']
        products = {'item':[],'price':[]}
        for prod in loaded_products:
            products['item']  += [prod['item']['name']]
            products['price'] += [prod['item']['offers']['offers'][0]['price']]
    
        data[category] = products
        print()
    

    The code prints various info while looping, and in the end you have something like this

    [1/10] Electro/aire-acondicionado-y-ventilacion (loaded products=7, total=7)
    [2/10] Almacen/Aceites-y-Vinagres (loaded products=87, total=87)
    [3/10] Almacen/Desayuno-y-Merienda (loaded products=808, total=808)
    [4/10] Lacteos/Leches (loaded products=80, total=80)
    [5/10] Frutas-y-Verduras/Frutas (loaded products=70, total=70)
    [6/10] Bebes-y-Ninos/Jugueteria (loaded products=57, total=57)
    [7/10] Quesos-y-Fiambres/Fiambres (loaded products=19, total=19)
    [8/10] Panaderia-y-Reposteria/Panaderia (loaded products=17, total=17)
    [9/10] Mascotas/Perros (loaded products=66, total=66)
    [10/10] Bebidas/Gaseosas (loaded products=64, total=64)
    

    To visualize the scraped data you can run pd.DataFrame(data[categories[idx]]) where idx is an integer from 0 to len(categories)-1. For example for idx=1 you get

    enter image description here