pythonselenium-webdriverweb-scrapingdynamic-tablesstaleelementreferenceexception

Issues with Selenium/Python in Scraping Data from Dynamic Table: Staleness Exception with Expected Conditions And Performance Decrease


I'm a beginner and have a lot to learn, so please be patient with me.

Using Python and Selenium, I'm trying to scrap table data from a website while navigating through different pages. As I navigate through different pages, the table shows the updated data, but it doesn't refresh the page, and the URL remains the same.

To get the refreshed data from the table and avoid stale element exception, I used WebDriverWait and expected_conditions (tr elements). Even with the wait, my code didn't get the refreshed data. It was getting the old data from the previous page and was giving the exception. So, I added time.sleep() after I clicked the next page button, which solved the problem.

However, I noticed my code was getting slower as I was navigating more and more pages. So, at around page 120, it gave me the stale element exception and was not able to get the refreshed data. I'm assuming it is because I'm using a for loop within a while loop that slows down the performance.

I tried implicit wait and increased time.sleep() gradually to avoid staleness exception, but nothing was working. There are 100 table rows in each page and around 3,100 pages total.

The followings are the problems:

  1. Why do I get the stale element exception and how to avoid it
  2. How to increase the efficiency of the code

I searched a lot and really tried to fix it on my own before I decided to write here. I'm stuck here and don't know what to do. Please help, and thank you so much for your time.

while True:

    # waits until the table elements are visible when the page is loaded
    # this is a must step for Selenium to scrap data from the dynamic table when we navigate through different pages
    tr = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[@id='erdashboard']/tbody/tr")))
    
    for record in tr:
        count += 1
    
        posted_date = datetime.strptime(record.find_element(By.XPATH, './td[7]').text, "%m/%d/%Y").date()
    
        exclusion_request_dict["ID"].append(int(record.find_element(By.XPATH, './td[1]').text))
        exclusion_request_dict["Company"].append(record.find_element(By.XPATH, './td[2]').text)
        exclusion_request_dict["Product"].append(record.find_element(By.XPATH, './td[3]').text)
        exclusion_request_dict["HTSUSCode"].append(record.find_element(By.XPATH, './td[4]').text)
        exclusion_request_dict["Status"].append(record.find_element(By.XPATH, './td[5]').text)
        exclusion_request_dict["Posted Date"].append(posted_date)
    
    next_button = driver.find_element(By.ID, "erdashboard_next")
    next_button_clickable = driver.find_element(By.ID, "erdashboard_next").get_attribute("class").split(" ")
    print(next_button_clickable)
    print("Current Page:", page, "Total Counts:", count)
    
    if next_button_clickable[-1] == "disabled":
        break
    
    next_button.click() # goes to the next page
    time.sleep(wait + 0.01)

Solution

    1. When you click the next page button, you can avoid the stale element exception by, for example, checking when the ID in the first row has changed. This is done in the section of the code # wait until new page is loaded (see full code below).

    2. When scraping data from a table, you can increase the efficiency of the code with two tricks. First: loop over columns rather than over rows, because there are (almost always) more rows than columns. Second: use javascript instead of the selenium command .text, because js is way faster than .text. For example, to scrape the values in the first column, the command in selenium is

      [td.text for td in driver.find_elements(By.XPATH, '//tbody/tr/td[1]')]
      

      and it takes about 1.2 seconds on my computer, while the corresponding javascript command (see the code inside for idx in range(1,8) below) takes only about 0.008 seconds (150 times faster!). Actually, the first trick is slightly noticeable when using .text, but when using javascript is really effective: for example to scrape the whole table by rows with js it takes about 0.52 seconds, while by columns it takes about 0.05 seconds.

    Here is the full code:

    import math, time, pandas
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import Select
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import StaleElementReferenceException
    
    chromedriver_path = '...'
    driver = webdriver.Chrome(service=Service(chromedriver_path))
    wait = WebDriverWait(driver,9)
    driver.get('https://232app.azurewebsites.net/')
    
    dropdown = wait.until(EC.element_to_be_clickable((By.NAME, 'erdashboard_length')))
    target_number_of_rows = 100
    Select(dropdown).select_by_value(str(target_number_of_rows))
    
    # wait until 100 rows are loaded
    current_number_of_rows = 0
    while current_number_of_rows != target_number_of_rows:
        current_number_of_rows = len(driver.find_elements(By.CSS_SELECTOR, 'tbody tr'))
    
    header = [th.text for th in driver.find_elements(By.XPATH, '//tr/th[position()<last()]')]
    data = {key:[] for key in header}
    number_of_pages = int(driver.find_element(By.CSS_SELECTOR, '.paginate_button:last-child').text)
    times = []
    
    while 1:
        
        start = time.time()
        if len(times)>0:
            current_page = int(driver.find_element(By.CLASS_NAME, "current").text)
            mean = sum(times) / len(times)
            eta = (number_of_pages - current_page) * mean
            minutes = math.floor(eta/60)
            seconds = round((eta/60 - minutes)*60)
            print(f'current page {current_page} (ETA {minutes}:{seconds}) (mean per page {mean:.2f}s) ({len(data[header[0]])} rows scraped)',end='\r')
        
        for idx in range(1,8):
            data[header[idx-1]] += driver.execute_script("var result = [];" +
            f"var all = document.querySelectorAll('tbody>tr>td:nth-child({idx})');" +
            "for (var i=0, max=all.length; i < max; i++) {" +
            "    result.push(all[i].innerText);" +
            "} " +
            " return result;")
        
        # check if all lists in the dictionary have the same length, if not there is a problem (column missing or not scraped properly)
        lens = [len(data[h]) for h in header]
        if len(set(lens)) != 1:
            print('\nerror: lists in the dictionary have different lengths')
            print(lens)
            break
    
        # click next page button if available
        next_btn = driver.find_element(By.ID, 'erdashboard_next')
        if 'disabled' not in next_btn.get_attribute('class'):
            next_btn.click()
        else:
            print('\nno more pages to load')
            break
    
        # wait until new page is loaded
        firt_row_id_old = WebDriverWait(driver,9).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'tbody>tr>td'))).text
        firt_row_id_new = firt_row_id_old
        while firt_row_id_new == firt_row_id_old:
            try:
                firt_row_id_new = WebDriverWait(driver,9).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'tbody>tr>td'))).text
            except StaleElementReferenceException:
                continue
        times += [time.time() - start]
    

    While the loop is running you get an output like this ("ETA" is the estimated remaining time in the format minutes:seconds) ("mean per page" is the mean time it takes to execute each loop)

    current page 156 (ETA 73:58) (mean per page 1.52s) (15500 rows scraped)
    

    Then by running pandas.DataFrame(data) you get something like this

    enter image description here