rfor-loopweb-scraping

In R, when scraping, I am getting an error due to it identifying an extra column, then not identifying it


I am trying to scrape store information from the individual store specific urls for a particular chain. I am using R

I started by testing my scrape and got an error saying that

Error in mutate(., names = c("Name", "Address", "Phone")) : Caused by error: ! names must be size 4 or 1, not 3.

So I attempted to fix it by adding a "drop column" which worked.

When I tried to run the for loop it switches between saying it needs either 3 or 4 columns, whichever one that loop is not using. Please see the code below

Here are the steps I have taken and what the output is:

Initial Test

library(rvest)
library(dplyr)
library(tidyr)

trial_urls_1 <- "https://www.sprouts.com/store/co/lakewood/fairfield-commons/"


urls_1 <- trial_urls_1 %>% 
  read_html() %>% 
  html_elements(".store-details :nth-child(2)") %>% 
  html_text() %>% 
  tibble() %>% 
  rename(info = ".") %>% 
  unnest(col = info) %>% 
  mutate(names = c("Name", "Address", "Phone")) %>% #I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
  pivot_wider(names_from = names,
              values_from = info )

urls_1

Error:

Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone")`.
Caused by error:
! `names` must be size 4 or 1, not 3.
Backtrace:
  1. ... %>% pivot_wider(names_from = names, values_from = info)
 10. dplyr:::dplyr_internal_error(...)
Error in mutate(., names = c("Name", "Address", "Phone")) : 
Caused by error:
! `names` must be size 4 or 1, not 3.

Test 2 (that works!):

urls_2 <- trial_urls_1 %>% 
  read_html() %>% 
  html_elements(".store-details :nth-child(2)") %>% 
  html_text() %>% 
  tibble() %>% 
  rename(info = ".") %>% 
  unnest(col = info) %>% 
  mutate(names = c("Name", "Address", "Phone", "drop")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
  pivot_wider(names_from = names,
              values_from = info ) %>% 
  select(-drop)

urls_2

For loop with the 4 column names

#blank list
store_info_2 <- list()

#loop over all_store_urls
for (i in 1:length(all_store_urls)){
  store_info_2[[i]] <- all_store_urls[i] %>%
    read_html() %>% 
    html_elements(".store-details :nth-child(2)") %>% 
    html_text() %>% 
    tibble() %>% 
    rename(info = ".") %>% 
    unnest(col = info) %>% 
    mutate(names = c("Name", "Address", "Phone", "drop")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
    pivot_wider(names_from = names,
              values_from = info )
}

Which yields this error:

Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone", "drop")`.
Caused by error:
! `names` must be size 3 or 1, not 4.
Backtrace:
  1. ... %>% pivot_wider(names_from = names, values_from = info)
 10. dplyr:::dplyr_internal_error(...)

so I reverted to 3 names and the same happened:

store_info_1 <- list()

#loop over all_store_urls
for (i in 1:length(all_store_urls)){
  store_info_1[[i]] <- all_store_urls[i] %>%
    read_html() %>% 
    html_elements(".store-details :nth-child(2)") %>% 
    html_text() %>% #extract link
    tibble() %>%
    rename(info = ".") %>% 
    unnest(col = info) %>% 
    mutate(names = c("Name", "Address", "Phone")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
    pivot_wider(names_from = names,
              values_from = info )
}

Error:

Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone")`.
Caused by error:
! `names` must be size 4 or 1, not 3.
Backtrace:
  1. ... %>% pivot_wider(names_from = names, values_from = info)
 10. dplyr:::dplyr_internal_error(...)

What am I missing?


Solution

  • For included example page your CSS selector (.store-details :nth-child(2)) returns 4 elements:

    library(rvest)
    
    trial_urls_1 <- "https://www.sprouts.com/store/co/lakewood/fairfield-commons/"
    html  <- read_html(trial_urls_1)
    
    html |> 
      html_elements(".store-details :nth-child(2)")
    #> {xml_nodeset (4)}
    #> [1] <h1>Lakewood</h1>
    #> [2] <a href="https://www.google.com/maps/search/?api=1&amp;query=39.716896%2C ...
    #> [3] <a href="tel:303-957-9276" role="button" tabindex="0" aria-hidden="false" ...
    #> [4] <br>
    

    But it's quite loose and can easily return a different number of elements for some other store you are testing with (e.g. 2 for https://www.sprouts.com/store/co/littleton/belleview-ave/ and 3 for https://www.sprouts.com/store/co/denver/central-park-blvd/ ). The approach where you expect to get a 4x1 tibbble so you could add a column with 4 items does not seem to be too robust ..

    I'd rather aim for explicit selectors for each extracted bit, this way the output shape is always fixed. If there are pages where some selectors are left without a match, that item / column would still be present, just filled with NA as shown here by using .store-details .not-present :

    list(
      Name    = html_element(html, ".store-details h1"),
      Address = html_element(html, ".store-details .store-address"),
      Phone   = html_element(html, ".store-details .store-phone"),
      Missing = html_element(html, ".store-details .not-present")
    ) |> 
      lapply(html_text2) |> 
      tibble::as_tibble()
    #> # A tibble: 1 × 4
    #>   Name     Address                                                 Phone Missing
    #>   <chr>    <chr>                                                   <chr> <chr>  
    #> 1 Lakewood "Fairfield Commons\n98 Wadsworth Blvd. Lakewood, CO 80… 303-… <NA>