rubyweb-scrapingnokogirimechanize

Scrape a URL for data that is loaded with Javascript using Ruby


I am trying to scrape this page for rental listings using a Ruby script. Some of the methods that I have tried unsuccessfully are using Nokogiri and Mechanize however the browser only loads 14 listings the rest are loaded through what I presume is embedded javascript. I have briefly looked at rkelly with no luck in reading through the classes available.

Here is what I have so far:

##First Solution only returned 14 Results
require 'mechanize'
require 'nokogiri'
require 'open-uri'

url = "http://streeteasy.com/for-rent/soho/"

listings = Nokogiri::HTML(open(url))

# agent = Mechanize.new
# agent.get(url)
# pp signin_page = agent.page.link_with(:text => 'Sign In').click
# # pp signin_page.forms

listing_sorted = listings.css('.item_inner')

object = listing_sorted.map do |listing|
    object = {}
        object[:address] = listing.css("div.details_title a").first.inner_html
        object[:price] = listing.css("span.price").inner_html.gsub(/[^0-9.]/, '')
    object
end

sorted_object = object.sort! { |a,b| a[:price].to_i <=> b[:price].to_i }.last 20


puts @json_object = sorted_object.to_json
puts "There are #{sorted_object.length} listings"

There is also an xls file that you can export the listings to however you need to be logged in and the sign in is a javascript modal, so im really reaching a sticking point here. What would be the best way to approach this problem.


Solution

  • What I managed to do is use Watir, a Ruby Wrapper for Selenium to open the page in a browser and then pass the loaded html into Nokogiri for parsing.