I am trying to scrape this webpage https://mst.dk/publikationer, it has pagination and looking at the source, it looks like it is happening in the section I've added below.
I've tried multiple approaches including adding page=x to the url, or using selenium different locators and selectors, increasing wait time, trying to use next button, or imitate a click on list items. Nothing seems to be woking for me. Can anybody please help me figuring out the dynamics of this page and how to paginate through it? What I am trying to do is open each link in each page and find the pdf and download it, which works fine for the first page, using the code below:
def parse_epa_filtered_keywords():
# Get number of search results
page_no = int(int(get_number_of_results(link_filtered)) / 10) + 1
driver = webdriver.Chrome(options=options)
search_query = '+'.join(keywords.split())
for i in tqdm(range(1, page_no + 1)):
search_url = f"{link_filtered}?search={search_query}&page={i}"
print(f"Fetching URL: {search_url}")
# Load the search URL
# Wait for the page to load completely
time.sleep(5) # Adjust the sleep time as needed
# Wait for the main page to load again
publications = driver.find_elements(By.CSS_SELECTOR, 'a[class^="Link_Link__lzynb SearchResultItem_SearchResult"]')
Obviously it is the effort using the page, which keeps opening the first page over and over. then I tried to use the following items:
next_button = driver.find_element(By.XPATH, "//li[contains(@class, 'Pagination_Pagination_next')]/a[@rel='next']")
next_button = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "li.Pagination_Pagination_next_N6tkt a")))
and many more tries with different elements, which either lead to a general chrome driver error, or something like :
An error occurred: Message: element click intercepted: Element is not clickable at point (732, 2911)
(Session info: chrome=128.0.6613.114)
next_button = driver.find_element(By.XPATH, "//li[contains(@class, 'Pagination_Pagination_next')]/a[@rel='next']")
Although the XPath expression in your above code is correct, for some reason it is not clicking the element. I used ActionChains
as below and it successfully clicked the next button.
next_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//a[@aria-label='Next page']")))
actions = ActionChains(driver)
Here is a full working code which will scrape the pages in a loop.
Note: I am scraping the first 3 pages and scraping the search results headings you can scrape whatever you want:
from selenium.webdriver import ActionChains
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
def click_next_page():
next_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//a[@aria-label='Next page']")))
actions = ActionChains(driver)
def extract_headings(wait):
headings = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//li//h3")))
search_results_headings = ""
for heading in headings:
search_results_headings += "\n" + heading.text
return search_results_headings
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
# Use below line of code only if you see accept/reject cookies pop-up
accept_all = wait.until(EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll")))
driver.execute_script("arguments[0].click();", accept_all)
search_results_headings = ""
# Below for loop iterates 3 times, so 3 pages will be scraped, if you want more pages change the range accordingly
for _ in range(3):
search_results_headings += extract_headings(wait)
Console output:
