Consider two webpages one and two. Site number two is easy to scrape using nokogiri because it doesn't use JS. Site number one however cannot be scraped using just nokogiri. I googled and searched far and wide and found that if I loaded the page with an automated web browser I could scrape the the rendered HTML. I have the following code right below:
# creates an instance
driver = Selenium::WebDriver.for :chrome
# opens an existing webpage
driver.get 'http://www.bigstub.com/search.aspx'
# wait is used to let the webpage load up and let the JS render
wait = Selenium::WebDriver::Wait.new(:timeout => 5)
My question is that I am trying to let the page load up an close immediately once I get my desired class. An example is that if I adjust the time out to 10 seconds until I can find the class .title-holder
how would I write this code?
Pusedo code:
rendered_source_page will time out if .include?("title-holder")
. I just don't know how to write it.
UPDATE: In regards to the headless question, selenium has an option or configuration in where you can add in a headless option. This is done by the code below:
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
driver = Selenium::WebDriver.for :chrome, options: options
For my next question in order for the site to fully scrape the JS rendered HTML I set my timeout variable to 5 seconds:
wait = Selenium::WebDriver::Wait.new(:timeout => 5)
wait.until { /title-holder/.match(driver.page_source) }
wait.until
pretty much means wait 5 seconds until I find a title-holder
class inside of the page_source
or rendered HTML. This pretty much solved all my questions.
In regards to the headless question, selenium has an option or configuration in where you can add in a headless option. This is done by the code below:
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
driver = Selenium::WebDriver.for :chrome, options: options
For my next question in order for the site to fully scrape the JS rendered HTML I set my timeout variable to 5 seconds:
wait = Selenium::WebDriver::Wait.new(:timeout => 5)
wait.until { /title-holder/.match(driver.page_source) }
wait.until
pretty much means wait 5 seconds until I find a title-holder
class inside of the page_source
or rendered HTML. This pretty much solved all my questions.