python selenium dom web-scraping pageloadstrategy

Selenium download entire html

I have been trying to use selenium to scrape and entire web page. I expect at least a handful of them are spa's such as Angular, React, Vue so that is why I am using Selenium.

I need to download the entire page (if some content isn't loaded from lazy loading because of not scrolling down that is fine). I have tried setting a time.sleep() delay, but that has not worked. After I get the page I am looking to hash it and store it in a db to compare later and check to see if the content has changed. Currently the hash is different every time and that is because selenium is not downloading the entire page, each time a different partial amount is missing. I have confirmed this on several web pages not just a singular one.

I also have probably a 1000+ web pages to go through by hand just getting all the links so I do not have time to find an element on them to make sure it is loaded.

How long this process takes is not important. If it takes 1+ hours so be it, speed is not important only accuracy.

If you have an alternative idea please also share.

My driver declaration

 from selenium import webdriver
 from selenium.common.exceptions import WebDriverException

 driverPath = '/usr/lib/chromium-browser/chromedriver'

 def create_web_driver():
     options = webdriver.ChromeOptions()
     options.add_argument('headless')

     # set the window size
     options.add_argument('window-size=1200x600')

     # try to initalize the driver
     try:
         driver = webdriver.Chrome(executable_path=driverPath, chrome_options=options)
     except WebDriverException:
         print("failed to start driver at path: " + driverPath)

     return driver

My url call my timeout = 20

 driver.get(url)
 time.sleep(timeout)
 content = driver.page_source

 content = content.encode('utf-8')
 hashed_content = hashlib.sha512(content).hexdigest()

^ getting different hash here every time since same url not producing same web page

Solution

As the Application Under Test(AUT) is based on Angular, React, Vue in that case Selenium seems to be the perfect choice.

Now, as you are fine with the fact that some content isn't loaded from lazy loading because of not scrolling makes the usecase feasible. But in all possible ways ...do not have time to find an element on them to make sure it is loaded... can't be really compensated inducing time.sleep() as time.sleep() have certain drawbacks. You can find a detailed discussion in How to sleep webdriver in python for milliseconds. It would be worth to mention that the state of the HTML DOM will be different for all the 1000 odd web pages.

Solution

A couple of viable solutions:

A pottential solution could have been to induce WebDriverWait and ensure that some HTML elements are loaded as per the discussion How can I make sure if some HTML elements are loaded for Selenium + Python? validating atleast either of the following:
- Page Title
- Page Heading
Another solution would be to tweak the capability pageLoadStrategy. You can set the pageLoadStrategy for all the 1000 odd web pages to common point assigning a value either:
- normal (full page load)
- eager (interactive)
- none
You can find a detailed discussion in How to make Selenium not wait till full page load, which has a slow script?

If you implement pageLoadStrategy, page_source method will be triggered at the same tripping point and possibly you would see identical hashed_content.