I'm a beginner and have a lot to learn, so please be patient with me.
Using Python and Selenium, I'm trying to scrap table data from a website while navigating through different pages. As I navigate through different pages, the table shows the updated data, but it doesn't refresh the page, and the URL remains the same.
To get the refreshed data from the table and avoid stale element exception, I used WebDriverWait and expected_conditions (tr elements). Even with the wait, my code didn't get the refreshed data. It was getting the old data from the previous page and was giving the exception. So, I added time.sleep() after I clicked the next page button, which solved the problem.
However, I noticed my code was getting slower as I was navigating more and more pages. So, at around page 120, it gave me the stale element exception and was not able to get the refreshed data. I'm assuming it is because I'm using a for loop within a while loop that slows down the performance.
I tried implicit wait and increased time.sleep() gradually to avoid staleness exception, but nothing was working. There are 100 table rows in each page and around 3,100 pages total.
The followings are the problems:
I searched a lot and really tried to fix it on my own before I decided to write here. I'm stuck here and don't know what to do. Please help, and thank you so much for your time.
while True:
# waits until the table elements are visible when the page is loaded
# this is a must step for Selenium to scrap data from the dynamic table when we navigate through different pages
tr = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[@id='erdashboard']/tbody/tr")))
for record in tr:
count += 1
posted_date = datetime.strptime(record.find_element(By.XPATH, './td[7]').text, "%m/%d/%Y").date()
exclusion_request_dict["ID"].append(int(record.find_element(By.XPATH, './td[1]').text))
exclusion_request_dict["Company"].append(record.find_element(By.XPATH, './td[2]').text)
exclusion_request_dict["Product"].append(record.find_element(By.XPATH, './td[3]').text)
exclusion_request_dict["HTSUSCode"].append(record.find_element(By.XPATH, './td[4]').text)
exclusion_request_dict["Status"].append(record.find_element(By.XPATH, './td[5]').text)
exclusion_request_dict["Posted Date"].append(posted_date)
next_button = driver.find_element(By.ID, "erdashboard_next")
next_button_clickable = driver.find_element(By.ID, "erdashboard_next").get_attribute("class").split(" ")
print(next_button_clickable)
print("Current Page:", page, "Total Counts:", count)
if next_button_clickable[-1] == "disabled":
break
next_button.click() # goes to the next page
time.sleep(wait + 0.01)
When you click the next page button, you can avoid the stale element exception by, for example, checking when the ID in the first row has changed. This is done in the section of the code # wait until new page is loaded
(see full code below).
When scraping data from a table, you can increase the efficiency of the code with two tricks. First: loop over columns rather than over rows, because there are (almost always) more rows than columns. Second: use javascript instead of the selenium command .text
, because js is way faster than .text
. For example, to scrape the values in the first column, the command in selenium is
[td.text for td in driver.find_elements(By.XPATH, '//tbody/tr/td[1]')]
and it takes about 1.2 seconds on my computer, while the corresponding javascript command (see the code inside for idx in range(1,8)
below) takes only about 0.008 seconds (150 times faster!). Actually, the first trick is slightly noticeable when using .text
, but when using javascript is really effective: for example to scrape the whole table by rows with js it takes about 0.52 seconds, while by columns it takes about 0.05 seconds.
Here is the full code:
import math, time, pandas
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import StaleElementReferenceException
chromedriver_path = '...'
driver = webdriver.Chrome(service=Service(chromedriver_path))
wait = WebDriverWait(driver,9)
driver.get('https://232app.azurewebsites.net/')
dropdown = wait.until(EC.element_to_be_clickable((By.NAME, 'erdashboard_length')))
target_number_of_rows = 100
Select(dropdown).select_by_value(str(target_number_of_rows))
# wait until 100 rows are loaded
current_number_of_rows = 0
while current_number_of_rows != target_number_of_rows:
current_number_of_rows = len(driver.find_elements(By.CSS_SELECTOR, 'tbody tr'))
header = [th.text for th in driver.find_elements(By.XPATH, '//tr/th[position()<last()]')]
data = {key:[] for key in header}
number_of_pages = int(driver.find_element(By.CSS_SELECTOR, '.paginate_button:last-child').text)
times = []
while 1:
start = time.time()
if len(times)>0:
current_page = int(driver.find_element(By.CLASS_NAME, "current").text)
mean = sum(times) / len(times)
eta = (number_of_pages - current_page) * mean
minutes = math.floor(eta/60)
seconds = round((eta/60 - minutes)*60)
print(f'current page {current_page} (ETA {minutes}:{seconds}) (mean per page {mean:.2f}s) ({len(data[header[0]])} rows scraped)',end='\r')
for idx in range(1,8):
data[header[idx-1]] += driver.execute_script("var result = [];" +
f"var all = document.querySelectorAll('tbody>tr>td:nth-child({idx})');" +
"for (var i=0, max=all.length; i < max; i++) {" +
" result.push(all[i].innerText);" +
"} " +
" return result;")
# check if all lists in the dictionary have the same length, if not there is a problem (column missing or not scraped properly)
lens = [len(data[h]) for h in header]
if len(set(lens)) != 1:
print('\nerror: lists in the dictionary have different lengths')
print(lens)
break
# click next page button if available
next_btn = driver.find_element(By.ID, 'erdashboard_next')
if 'disabled' not in next_btn.get_attribute('class'):
next_btn.click()
else:
print('\nno more pages to load')
break
# wait until new page is loaded
firt_row_id_old = WebDriverWait(driver,9).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'tbody>tr>td'))).text
firt_row_id_new = firt_row_id_old
while firt_row_id_new == firt_row_id_old:
try:
firt_row_id_new = WebDriverWait(driver,9).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'tbody>tr>td'))).text
except StaleElementReferenceException:
continue
times += [time.time() - start]
While the loop is running you get an output like this ("ETA" is the estimated remaining time in the format minutes:seconds
) ("mean per page" is the mean time it takes to execute each loop)
current page 156 (ETA 73:58) (mean per page 1.52s) (15500 rows scraped)
Then by running pandas.DataFrame(data)
you get something like this