ruby web-scraping nokogiri httparty open-uri

How to extract data from dynamic collapsing table with hidden elements using Nokogiri and Ruby

I am trying to scrape through the following website : https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html

to get all of the state statistics on coronavirus.

My code below works:

require 'nokogiri'
require 'open-uri'
require 'httparty'
require 'pry'

  url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
  doc = Nokogiri::HTML.parse(open(url))
  total_cases = doc.css("span.count")[0].text
  total_deaths = doc.css("span.count")[1].text
  new_cases = doc.css("span.new-cases")[0].text
  new_deaths = doc.css("span.new-cases")[1].text

However, I am unable to get into the collapsed data/gridcell data.

I have tried searching by the class .aria-label and by the .rt-tr-group class. Any help would be appreciated. Thank you.

Solution

Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.

Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.

When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".

When browsing through the requests you'll find that the data you're looking for is located at:

https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json

Since this is JSON you don't need "nokogiri" to parse it.

require 'httparty'
require 'json'

response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)

When executing the above you'll get the exception:

JSON::ParserError ...

This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.

response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"

To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.

require 'httparty'
require 'json'
require 'stringio'

response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145,  ...

If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:

data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))