I want to scrape a website i.e. is https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=
using selenium but I am able to scrape only one page not other pages.
Here I am using selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(executable_path='C:/Users/ptiwar34/Documents/chromedriver.exe', chrome_options=chromeOptions, desired_capabilities=chromeOptions.to_capabilities())
driver.get('https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=')
WebDriverWait(driver, 20).until(EC.staleness_of(driver.find_element_by_xpath("//td/a[text()='2']")))
driver.find_element_by_xpath("//td/a[text()='2']").click()
numLinks = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td/a[text()='2']"))))
print(numLinks)
for i in range(numLinks):
print("Perform your scraping here on page {}".format(str(i+1)))
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//td/a[text()='2']/span//following::span[1]"))).click()
driver.quit()
here is the html content
<td><span>1</span></td>
<td><a
href="javascript:__doPostBack
('dnn$ctr1535$UNSPSCSearch$gvDetailsSearchView','Page$2')"
style="color:#333333;">2</a>
</td>
This throws an error:
raise TimeoutException(message, screen, stacktrace)
TimeoutException
To scrape the website https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=
using Selenium you can use the following Locator Strategy:
Code Block:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("start-maximized")
driver = webdriver.Chrome(options=chrome_options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=%27")
while True:
try:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//table[contains(@id, 'UNSPSCSearch_gvDetailsSearchView')]//tr[last()]//table//span//following::a[1]"))).click()
print("Clicked for next page")
except TimeoutException:
print("No more pages")
break
driver.quit()
Console Output:
Clicked for next page
Clicked for next page
Clicked for next page
.
.
.
Explaination: If you observe the HTML DOM the page numbers are within a <table>
with a dynamic id
attribute containing the text UNSPSCSearch_gvDetailsSearchView. Further the page numbers are within the last <tr>
which is having a child <table>
. With in the child table the current page number is within a <span>
which holds the key. So to click()
on the next page number you just need to identify the following <a>
tag with index [1]
. Finally, as the element is having javascript:__doPostBack()
you have to induce WebDriverWait for the desired element_to_be_clickable()
.
You can find a detailed discussion in How do I wait for a JavaScript __doPostBack call through Selenium and WebDriver