ruby-on-railsrubyweb-scrapingnokogirihttparty

Why is my Ruby function web scraping the first page only and not the paginated pages?


I am using nokogiri to web scrape all vehicles across about 14 pages in a dealership website, the bug I am encountering is my code is running the scraper 14 times on only the first page. What is wrong in my code?

As you can see from the output the same vehicles are being scraped over and over again instead of the new set of vehicles from the next page.

Ruby version : 2.6.2

scraper.rb :

    require 'nokogiri'
    require 'httparty'
    require 'byebug'
    
    def scraper
        url = "https://www.example.com/new-vehicles/"
        unparsed_page = HTTParty.get(url)
        parsed_page = Nokogiri::HTML(unparsed_page)
        vehicles = Array.new
        vehicle_listings = parsed_page.css("//div[@class='vehicle list-view new-vehicle publish']") #20 cars
        page = 1
        per_page = vehicle_listings.count  #20
        total = parsed_page.css('span.count').text.to_i #281
        last_page = (total.to_f / per_page.to_f).ceil #14
        while page <= last_page
            pagination_url = "https://www.example.com/new-vehicles/#action=im_ajax_call&perform=get_results&page=#{page}"
            pagination_unparsed_page = HTTParty.get(pagination_url)
            puts pagination_url
            puts "Page: #{page}"
            puts ''
            pagination_parsed_page = Nokogiri::HTML(pagination_unparsed_page)
            pagination_vehicle_listings = pagination_parsed_page.css("//div[@class='vehicle list-view new-vehicle publish']") #20 cars
            pagination_vehicle_listings.each do |vehicle_listing|
                vehicle = {
                    title: vehicle_listing.css('h2')&.text&.gsub("New", '').gsub("2021", '').gsub("With", '').gsub("Navigation", ''),
                    price: vehicle_listing.css('span.price')[0]&.text&.delete('^0-9').to_i,
                    stock_number: vehicle_listing.css('.stock-label')&.text.gsub("Stock #: ", ''),
                    exterior_color: vehicle_listing.css('span.detail-content')[3]&.text,
                    interior_color: vehicle_listing.css('span.detail-content')[4]&.text&.delete('0-9').gsub('MPG', 'unavailable')
                }
                vehicles << vehicle
                    puts "Added #{vehicle[:stock_number]}"
                    puts ""
                end
                page += 1
        end
    byebug
    end
        
scraper

output :

Page: 1

Added vehicle with stock#: 218864

Added vehicle with stock#: 218865

Added vehicle with stock#: 218604

https://www.example.com/new-vehicles/#action=im_ajax_call&perform=get_results&page=2

Page: 2

Added vehicle with stock#: 218864

Added vehicle with stock#: 218865

Added vehicle with stock#: 218604


Solution

  • I was able to scrape the data I needed by using Watir and a headless chrome browser which allowed the ajax/javascript code to run before the page was scraped.

        require 'nokogiri'
        require 'httparty'
        require 'byebug'
        require 'watir'
        
        def scraper
            url = "https://www.example.com/new-vehicles/"
            unparsed_page = HTTParty.get(url)
            parsed_page = Nokogiri::HTML(unparsed_page)
            vehicles = Array.new
            vehicle_listings = parsed_page.css("//div[@class='vehicle list-view new-vehicle publish']") #20 cars
            page = 1
            per_page = vehicle_listings.count  #20
            total = parsed_page.css('span.count').text.to_i #281
            last_page = (total.to_f / per_page.to_f).ceil #14
            
            # Create instance of headless chrome called browser
            browser = Watir::Browser.new :chrome, headless: true
            
            while page <= last_page
                pagination_url = "https://www.example.com/new-vehicles/#action=im_ajax_call&perform=get_results&page=#{page}"
                browser.goto(pagination_url)
                pagination_unparsed_page = Nokogiri::HTML(browser.html)
                puts pagination_url
                puts "Page: #{page}"
                puts ''
                pagination_parsed_page = pagination_unparsed_page
                pagination_vehicle_listings = pagination_parsed_page.css("//div[@class='vehicle list-view new-vehicle publish']") #20 cars
                pagination_vehicle_listings.each do |vehicle_listing|
                    vehicle = {
                        stock_number: vehicle_listing.css('.stock-label')&.text.gsub("Stock #: ", '')
                    }
                    vehicles << vehicle
                        puts "Added #{vehicle[:stock_number]}"
                        puts ""
                    end
                    page += 1
            end
            # Be sure to close the browser
            browser.close
        byebug
        end
    
    scraper