I am using nokogiri to web scrape all vehicles across about 14 pages in a dealership website, the bug I am encountering is my code is running the scraper 14 times on only the first page. What is wrong in my code?
As you can see from the output the same vehicles are being scraped over and over again instead of the new set of vehicles from the next page.
Ruby version : 2.6.2
scraper.rb :
require 'nokogiri'
require 'httparty'
require 'byebug'
def scraper
url = "https://www.example.com/new-vehicles/"
unparsed_page = HTTParty.get(url)
parsed_page = Nokogiri::HTML(unparsed_page)
vehicles = Array.new
vehicle_listings = parsed_page.css("//div[@class='vehicle list-view new-vehicle publish']") #20 cars
page = 1
per_page = vehicle_listings.count #20
total = parsed_page.css('span.count').text.to_i #281
last_page = (total.to_f / per_page.to_f).ceil #14
while page <= last_page
pagination_url = "https://www.example.com/new-vehicles/#action=im_ajax_call&perform=get_results&page=#{page}"
pagination_unparsed_page = HTTParty.get(pagination_url)
puts pagination_url
puts "Page: #{page}"
puts ''
pagination_parsed_page = Nokogiri::HTML(pagination_unparsed_page)
pagination_vehicle_listings = pagination_parsed_page.css("//div[@class='vehicle list-view new-vehicle publish']") #20 cars
pagination_vehicle_listings.each do |vehicle_listing|
vehicle = {
title: vehicle_listing.css('h2')&.text&.gsub("New", '').gsub("2021", '').gsub("With", '').gsub("Navigation", ''),
price: vehicle_listing.css('span.price')[0]&.text&.delete('^0-9').to_i,
stock_number: vehicle_listing.css('.stock-label')&.text.gsub("Stock #: ", ''),
exterior_color: vehicle_listing.css('span.detail-content')[3]&.text,
interior_color: vehicle_listing.css('span.detail-content')[4]&.text&.delete('0-9').gsub('MPG', 'unavailable')
}
vehicles << vehicle
puts "Added #{vehicle[:stock_number]}"
puts ""
end
page += 1
end
byebug
end
scraper
output :
Page: 1
Added vehicle with stock#: 218864
Added vehicle with stock#: 218865
Added vehicle with stock#: 218604
https://www.example.com/new-vehicles/#action=im_ajax_call&perform=get_results&page=2
Page: 2
Added vehicle with stock#: 218864
Added vehicle with stock#: 218865
Added vehicle with stock#: 218604
I was able to scrape the data I needed by using Watir and a headless chrome browser which allowed the ajax/javascript code to run before the page was scraped.
require 'nokogiri'
require 'httparty'
require 'byebug'
require 'watir'
def scraper
url = "https://www.example.com/new-vehicles/"
unparsed_page = HTTParty.get(url)
parsed_page = Nokogiri::HTML(unparsed_page)
vehicles = Array.new
vehicle_listings = parsed_page.css("//div[@class='vehicle list-view new-vehicle publish']") #20 cars
page = 1
per_page = vehicle_listings.count #20
total = parsed_page.css('span.count').text.to_i #281
last_page = (total.to_f / per_page.to_f).ceil #14
# Create instance of headless chrome called browser
browser = Watir::Browser.new :chrome, headless: true
while page <= last_page
pagination_url = "https://www.example.com/new-vehicles/#action=im_ajax_call&perform=get_results&page=#{page}"
browser.goto(pagination_url)
pagination_unparsed_page = Nokogiri::HTML(browser.html)
puts pagination_url
puts "Page: #{page}"
puts ''
pagination_parsed_page = pagination_unparsed_page
pagination_vehicle_listings = pagination_parsed_page.css("//div[@class='vehicle list-view new-vehicle publish']") #20 cars
pagination_vehicle_listings.each do |vehicle_listing|
vehicle = {
stock_number: vehicle_listing.css('.stock-label')&.text.gsub("Stock #: ", '')
}
vehicles << vehicle
puts "Added #{vehicle[:stock_number]}"
puts ""
end
page += 1
end
# Be sure to close the browser
browser.close
byebug
end
scraper