I am trying to scrape this webpage https://mst.dk/publikationer, it has pagination and looking at the source, it looks like it is happening in the section I've added below.
<div class="Container_Container__G5vVd Container_Container___width_std__y2_Pn">
<div class="Pagination_Pagination_wrapper__kp62j">
<ul class="Pagination_Pagination__UOZ60" role="navigation" aria-label="Pagination">
<li class="Pagination_Pagination_prev__zIUqn Pagination_Pagination_item___disabled__g5CaR">
<a class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_prevLink__HDKS4" tabindex="-1" role="button" aria-disabled="true" aria-label="Previous page" rel="prev"></a>
</li>
<li class="Pagination_Pagination_item__suqyV selected">
<a rel="canonical" role="button" class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_link___active__to_Os" tabindex="-1" aria-label="Side 1" aria-current="page">1</a>
</li>
<li class="Pagination_Pagination_item__suqyV">
<a role="button" class="Pagination_Pagination_link__Z2LW0" tabindex="0" aria-label="Side 2" rel="next">2</a>
</li>
<li class="Pagination_Pagination_break__dKVzB">
<a class="Pagination_Pagination_breakLink__jB8Rd" role="button" tabindex="0">...</a>
</li>
<li class="Pagination_Pagination_item__suqyV">
<a role="button" class="Pagination_Pagination_link__Z2LW0" tabindex="0" aria-label="Side 321">321</a>
</li>
<li class="Pagination_Pagination_next__N6tkt">
<a class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_nextLink__mytrA" tabindex="0" role="button" aria-disabled="false" aria-label="Next page" rel="next"></a>
</li>
</ul>
</div>
I've tried multiple approaches including adding page=x to the url, or using selenium different locators and selectors, increasing wait time, trying to use next button, or imitate a click on list items. Nothing seems to be woking for me. Can anybody please help me figuring out the dynamics of this page and how to paginate through it? What I am trying to do is open each link in each page and find the pdf and download it, which works fine for the first page, using the code below:
def parse_epa_filtered_keywords():
# Get number of search results
page_no = int(int(get_number_of_results(link_filtered)) / 10) + 1
driver = webdriver.Chrome(options=options)
search_query = '+'.join(keywords.split())
for i in tqdm(range(1, page_no + 1)):
try:
search_url = f"{link_filtered}?search={search_query}&page={i}"
print(f"Fetching URL: {search_url}")
# Load the search URL
driver.get(search_url)
# Wait for the page to load completely
time.sleep(5) # Adjust the sleep time as needed
# Wait for the main page to load again
publications = driver.find_elements(By.CSS_SELECTOR, 'a[class^="Link_Link__lzynb SearchResultItem_SearchResult"]')
....
driver.quit()
Obviously it is the effort using the page, which keeps opening the first page over and over. then I tried to use the following items:
next_button = driver.find_element(By.XPATH, "//li[contains(@class, 'Pagination_Pagination_next')]/a[@rel='next']")
or
next_button = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "li.Pagination_Pagination_next_N6tkt a")))
and many more tries with different elements, which either lead to a general chrome driver error, or something like :
An error occurred: Message: element click intercepted: Element is not clickable at point (732, 2911)
(Session info: chrome=128.0.6613.114)
Stacktrace:
0 chromedriver 0x0000000104f83998 cxxbridge1$str$ptr + 1887096
1 chromedriver 0x0000000104f7be00 cxxbridge1$str$ptr + 1855456
2 chromedriver 0x0000000104b80be0 cxxbridge1$string$len + 89508
3 chromedriver 0x0000000104bca6fc cxxbridge1$string$len + 391360
4 chromedriver 0x0000000104bc8d28 cxxbridge1$string$len + 384748
5 chromedriver
next_button = driver.find_element(By.XPATH, "//li[contains(@class, 'Pagination_Pagination_next')]/a[@rel='next']")
Although the XPath expression in your above code is correct, for some reason it is not clicking the element. I used ActionChains
as below and it successfully clicked the next button.
next_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//a[@aria-label='Next page']")))
actions = ActionChains(driver)
actions.move_to_element(next_button).click().perform()
Here is a full working code which will scrape the pages in a loop.
Note: I am scraping the first 3 pages and scraping the search results headings you can scrape whatever you want:
from selenium.webdriver import ActionChains
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
def click_next_page():
next_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//a[@aria-label='Next page']")))
actions = ActionChains(driver)
actions.move_to_element(next_button).click().perform()
def extract_headings(wait):
headings = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//li//h3")))
search_results_headings = ""
for heading in headings:
search_results_headings += "\n" + heading.text
return search_results_headings
driver = webdriver.Chrome()
driver.get("https://mst.dk/publikationer")
driver.maximize_window()
wait = WebDriverWait(driver, 10)
# Use below line of code only if you see accept/reject cookies pop-up
accept_all = wait.until(EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll")))
driver.execute_script("arguments[0].click();", accept_all)
search_results_headings = ""
# Below for loop iterates 3 times, so 3 pages will be scraped, if you want more pages change the range accordingly
for _ in range(3):
search_results_headings += extract_headings(wait)
click_next_page()
print(search_results_headings)
Console output:
Diffus forurening med PFAS i jord, grundvand og overfladevand
Digitale værktøjer til klimatilpasning
Performancebenchmarking
Oprensning af PFAS-forurening i jord, slam og vand - Test af teknologier i praksis
Lokalt funderede analyse – afrapportering
Maritime Emissionsløsninger i Kystnære Farvande
Biokinetisk lattergasreduktion i renseanlæg
Inter DAN NRW
Gennemførelse og anvendelse af slamdirektivet 2023
CombiControl - Combining above- and belowground biological control agents for improved pest control in strawberry tunnel production
Affaldsstatistik 2022
Scientific investigation of ballast water discharge - Random checks on ships in autumn – winter 2022
Control of Biocides 2023
Ny kosteffektiv teknologi til måling af klimagasudledninger fra renseanlæg
Recycling potential of separately collected post-consumer textile waste
Modelling and mapping pesticide exposure risk at the catchment scale (MOMAPEST)
Indberetning af status for anvendelse af almene vandforsyningsboringer i Virk.dk
PFAS i jord - International screening af andre landes praksis for håndtering af jord med PFAS
Anbefalinger til screening og kortlægning af bygge- og anlægsaffald
Emissions of Quaternary Alkylammonium Compounds
Nikotinposer – indhold og miljøkonsekvenser
Udredningsprojekt vedr. analysemetoder til undersøgelse for PFAS-forbindelser i jord, grundvand og overfladevand
Rensningsmuligheder for pesticider med fokus på aktivt kul og membraner
Renholds- og omkostningsanalyse jf. Engangsplastdirektivets oprydningsansvar
Kemiske stoffer i en cirkulær økonomi - Et MUDP projekt
Pesticider og biocider i den danske pindsvinebestand
Kortlægning af madaffald i primærproduktionen samt forarbejdnings- og fremstillingssektoren for 2022
Kortlægning af madaffald og madspild i restaurationsbranchen og restaurationstjenester for 2022
Inhibition of lung surfactant function as an alternative method to predict lung toxicity following exposure to plant protection products
Survey and risk assessment of pesticides in cut flowers from non-EU countries
Process finished with exit code 0