rweb-scrapingrvest

rvest Scrape all values from website by specific class


I am trying to scrape all location numbers, street address, and city/state/zip from this website. I have tried a few different approaches, unsuccessfully, including trying to extract the value from a specific class. By digging through the source code, I found that the three values I need have the following classes css-dhj0hp, css-8er82g, css-n2nvu7. Ideally, I'd want a dataframe with store number, street address, city, state, zip but I can't even get the text to return. Can someone help?

library(rvest)
library(tidyverse)

url <- "https://locations.wafflehouse.com"

page <- read_html(url)

store_data <- 
  page |> 
  html_nodes("div.css-dhj0hp")

Solution

  • /../ By digging through the source code /../

    You probably refer to Element Inspector which happens to be quite a different beast. It lets you navigate through DOM tree that for dynamic sites is rendered or at least heavily modified by Javascript and can be quite different from the content rvest tries to parse.

    In the actual page source, webapp data along with locations are embedded in the <script id="__NEXT_DATA__" type="application/json"> .. </script> element as JSON, we can extract it with rvest and parse with jsonlite; locations are bit deeper in resulting nested list, props > pageProps > locations:

    library(rvest)
    library(dplyr)
    library(tidyr)
    
    url <- "https://locations.wafflehouse.com" 
    
    page <- read_html(url)
    store_data <- 
      page |> 
      html_element("script#__NEXT_DATA__") |> 
      html_text() |> 
      jsonlite::fromJSON() |>
      purrr::pluck("props", "pageProps", "locations") |>
      unnest(addressLines) |>
      unnest(custom) |> 
      as_tibble()
    
    glimpse(store_data)
    #> Rows: 1,978
    #> Columns: 19
    #> $ storeCode              <chr> "100", "1000", "1001", "1002", "1003", "1004", …
    #> $ businessName           <chr> "Waffle House #100", "Waffle House #1000", "Waf…
    #> $ addressLines           <chr> "2842 PANOLA RD", "2840 E. COLLEGE AVE.", "1292…
    #> $ city                   <chr> "LITHONIA", "DECATUR", "LOUISVILLE", "NORMAN", …
    #> $ state                  <chr> "GA", "GA", "KY", "OK", "MS", "AL", "GA", "MO",…
    #> $ country                <chr> "US", "US", "US", "US", "US", "US", "US", "US",…
    #> $ operated_by            <chr> "WAFFLE HOUSE, INC", "WAFFLE HOUSE, INC", "FULL…
    #> $ online_order_link      <chr> NA, "https://order.wafflehouse.com/menu/waffle-…
    #> $ postalCode             <chr> "30058", "30030", "40243", "73072", "39520", "3…
    #> $ latitude               <dbl> 33.70471, 33.77522, 38.24359, 35.23244, 30.3132…
    #> $ longitude              <dbl> -84.16985, -84.27374, -85.51321, -97.48904, -89…
    #> $ phoneNumbers           <list> "(770) 981-1914", "(404) 294-8758", "(502) 244…
    #> $ websiteURL             <chr> "https://locations.wafflehouse.com///lithonia-g…
    #> $ businessHours          <list> <"00:00", "00:00", "00:00", "00:00", "00:00", …
    #> $ specialHours           <list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>…
    #> $ formattedBusinessHours <list> "Monday - Sunday| 24 hours", "Monday - Sunday|…
    #> $ slug                   <chr> "lithonia-ga-100", "decatur-ga-1000", "louisvil…
    #> $ localPageUrl           <chr> "/lithonia-ga-100", "/decatur-ga-1000", "/louis…
    #> $ `_status`              <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A…
    

    Created on 2023-08-20 with reprex v2.0.2