I'm trying to scrape URLs from a dynamically allocated webpage that requires continuous scrolling to load all the content into the DOM. My approach involves running window.scrollTo(0, document.body.scrollHeight);
in a loop using Selenium's execute_script
function. After each scroll, I compare the number of URLs loaded before and after the scroll. If the number of URLs doesn't change, I assume the end of the page has been reached and break the loop.
However, the script assumes that all content has been loaded into the DOM, even though I know new content is being loaded within the given timeout
. Below is my code:
def _scroll_page_to_bottom(self, timeout: int): # Todo: Fix Bugs
while True:
urls_before_scroll = self.browser.find_elements(
By.XPATH, read_xpath(self.scrape_programs_urls.__name__, "programs_urls")
)
self.browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait for new content to be loaded
try:
WebDriverWait(self.browser, timeout).until(
lambda _: len(self.browser.find_elements(
By.XPATH, read_xpath(self.scrape_programs_urls.__name__, "programs_urls"))
) > len(urls_before_scroll)
)
except TimeoutException:
# If no new content is loaded within the timeout, assume we've reached the end of the page
break
Can anyone please guess what could be causing the issue in the above code?
Edit: i did some debugging and found the issue is specifically related to scroll
functionality when i execute window.scrollTo(0, document.body.scrollHeight);
in the console
of the browser the page doesn't get scrolled to the bottom either which explains why my code is not working. The site am trying to scrape is https://hackerone.com/opportunities/all/search
This code below works well in scrolling down the page, try to embed it into your code:
ele = driver.find_element(By.XPATH, '//div[contains(@class,"Pane-module_u1-pane__content")]')
driver.execute_script('arguments[0].scrollIntoView(false);', ele)