I am trying to scrape store information from the individual store specific urls for a particular chain. I am using R
I started by testing my scrape and got an error saying that
Error in mutate(., names = c("Name", "Address", "Phone")) : Caused by error: !
names
must be size 4 or 1, not 3.
So I attempted to fix it by adding a "drop column" which worked.
When I tried to run the for loop it switches between saying it needs either 3 or 4 columns, whichever one that loop is not using. Please see the code below
Here are the steps I have taken and what the output is:
Initial Test
library(rvest)
library(dplyr)
library(tidyr)
trial_urls_1 <- "https://www.sprouts.com/store/co/lakewood/fairfield-commons/"
urls_1 <- trial_urls_1 %>%
read_html() %>%
html_elements(".store-details :nth-child(2)") %>%
html_text() %>%
tibble() %>%
rename(info = ".") %>%
unnest(col = info) %>%
mutate(names = c("Name", "Address", "Phone")) %>% #I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
pivot_wider(names_from = names,
values_from = info )
urls_1
Error:
Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone")`.
Caused by error:
! `names` must be size 4 or 1, not 3.
Backtrace:
1. ... %>% pivot_wider(names_from = names, values_from = info)
10. dplyr:::dplyr_internal_error(...)
Error in mutate(., names = c("Name", "Address", "Phone")) :
Caused by error:
! `names` must be size 4 or 1, not 3.
Test 2 (that works!):
urls_2 <- trial_urls_1 %>%
read_html() %>%
html_elements(".store-details :nth-child(2)") %>%
html_text() %>%
tibble() %>%
rename(info = ".") %>%
unnest(col = info) %>%
mutate(names = c("Name", "Address", "Phone", "drop")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
pivot_wider(names_from = names,
values_from = info ) %>%
select(-drop)
urls_2
For loop with the 4 column names
#blank list
store_info_2 <- list()
#loop over all_store_urls
for (i in 1:length(all_store_urls)){
store_info_2[[i]] <- all_store_urls[i] %>%
read_html() %>%
html_elements(".store-details :nth-child(2)") %>%
html_text() %>%
tibble() %>%
rename(info = ".") %>%
unnest(col = info) %>%
mutate(names = c("Name", "Address", "Phone", "drop")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
pivot_wider(names_from = names,
values_from = info )
}
Which yields this error:
Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone", "drop")`.
Caused by error:
! `names` must be size 3 or 1, not 4.
Backtrace:
1. ... %>% pivot_wider(names_from = names, values_from = info)
10. dplyr:::dplyr_internal_error(...)
so I reverted to 3 names and the same happened:
store_info_1 <- list()
#loop over all_store_urls
for (i in 1:length(all_store_urls)){
store_info_1[[i]] <- all_store_urls[i] %>%
read_html() %>%
html_elements(".store-details :nth-child(2)") %>%
html_text() %>% #extract link
tibble() %>%
rename(info = ".") %>%
unnest(col = info) %>%
mutate(names = c("Name", "Address", "Phone")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
pivot_wider(names_from = names,
values_from = info )
}
Error:
Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone")`.
Caused by error:
! `names` must be size 4 or 1, not 3.
Backtrace:
1. ... %>% pivot_wider(names_from = names, values_from = info)
10. dplyr:::dplyr_internal_error(...)
What am I missing?
For included example page your CSS selector (.store-details :nth-child(2)
) returns 4 elements:
library(rvest)
trial_urls_1 <- "https://www.sprouts.com/store/co/lakewood/fairfield-commons/"
html <- read_html(trial_urls_1)
html |>
html_elements(".store-details :nth-child(2)")
#> {xml_nodeset (4)}
#> [1] <h1>Lakewood</h1>
#> [2] <a href="https://www.google.com/maps/search/?api=1&query=39.716896%2C ...
#> [3] <a href="tel:303-957-9276" role="button" tabindex="0" aria-hidden="false" ...
#> [4] <br>
But it's quite loose and can easily return a different number of elements for some other store you are testing with (e.g. 2 for https://www.sprouts.com/store/co/littleton/belleview-ave/ and 3 for https://www.sprouts.com/store/co/denver/central-park-blvd/ ). The approach where you expect to get a 4x1 tibbble so you could add a column with 4 items does not seem to be too robust ..
I'd rather aim for explicit selectors for each extracted bit, this way the output shape is always fixed. If there are pages where some selectors are left without a match, that item / column would still be present, just filled with NA as shown here by using .store-details .not-present
:
list(
Name = html_element(html, ".store-details h1"),
Address = html_element(html, ".store-details .store-address"),
Phone = html_element(html, ".store-details .store-phone"),
Missing = html_element(html, ".store-details .not-present")
) |>
lapply(html_text2) |>
tibble::as_tibble()
#> # A tibble: 1 × 4
#> Name Address Phone Missing
#> <chr> <chr> <chr> <chr>
#> 1 Lakewood "Fairfield Commons\n98 Wadsworth Blvd. Lakewood, CO 80… 303-… <NA>