rubyselenium-webdriverweb-scrapingnokogiriscraper

How to extract JS rendered HTML using Selenium-webdriver and nokogiri?


Consider two webpages one and two. Site number two is easy to scrape using nokogiri because it doesn't use JS. Site number one however cannot be scraped using just nokogiri. I googled and searched far and wide and found that if I loaded the page with an automated web browser I could scrape the the rendered HTML. I have the following code right below:

# creates an instance
driver = Selenium::WebDriver.for :chrome

# opens an existing webpage
driver.get 'http://www.bigstub.com/search.aspx' 

# wait is used to let the webpage load up and let the JS render
wait = Selenium::WebDriver::Wait.new(:timeout => 5)

My question is that I am trying to let the page load up an close immediately once I get my desired class. An example is that if I adjust the time out to 10 seconds until I can find the class .title-holder how would I write this code?

Pusedo code: rendered_source_page will time out if .include?("title-holder"). I just don't know how to write it.

UPDATE: In regards to the headless question, selenium has an option or configuration in where you can add in a headless option. This is done by the code below:

options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
driver = Selenium::WebDriver.for :chrome, options: options

For my next question in order for the site to fully scrape the JS rendered HTML I set my timeout variable to 5 seconds:

wait = Selenium::WebDriver::Wait.new(:timeout => 5)
wait.until { /title-holder/.match(driver.page_source) }

wait.until pretty much means wait 5 seconds until I find a title-holder class inside of the page_source or rendered HTML. This pretty much solved all my questions.


Solution

  • In regards to the headless question, selenium has an option or configuration in where you can add in a headless option. This is done by the code below:

    options = Selenium::WebDriver::Chrome::Options.new
    options.add_argument('--headless')
    driver = Selenium::WebDriver.for :chrome, options: options
    

    For my next question in order for the site to fully scrape the JS rendered HTML I set my timeout variable to 5 seconds:

    wait = Selenium::WebDriver::Wait.new(:timeout => 5)
    wait.until { /title-holder/.match(driver.page_source) }
    

    wait.until pretty much means wait 5 seconds until I find a title-holder class inside of the page_source or rendered HTML. This pretty much solved all my questions.