rpurrrrbind

Convert nested lists to a dataframe in R


My goal is to convert the nested lists produced by my code into a dataframe. I have the following which extracts some data from several urls in a loop and stores them in lists.

library(rvest)
library(XML)
library(purrr)
library(stringr)
library(dplyr)

# declare variables
month = c('07','09')
year = c('2022','2023')
day = c('040','050')

# initialize the empty list
final = list()

# perform the loop
for (i in year) {
  for (j in month) {
    for (k in day) {
    
    skip_to_next <- FALSE
    
    url <- paste0('https://www.baseball-reference.com/boxes/ARI/ARI', i, j, k, '.shtml')
    
    Sys.sleep(5)
    game_path <- tryCatch(url |>
                            read_html() |>
                            html_nodes(xpath = '//div[contains(@id, "batting")]') |> 
                            map(\(x) x |> 
                                  as.character() |> 
                                  str_remove_all("<!--|-->") |> 
                                  read_html() |> 
                                  html_table()) |> 
                            unlist(recursive = FALSE), error = function(e) {skip_to_next <<- TRUE} )
    
    if(skip_to_next) {next}
    
    url <- read_html(url)
    
    list_url <- url %>%
      html_nodes(xpath = "//td/a") %>% 
      html_text() 
    
    List_2_letters = as.list(list_url[nchar(list_url) > 5])
    
    game_path <- mapply(cbind, game_path, "Date" = paste(gsub('.{1}$', '', k), j, i, sep = '-'), SIMPLIFY=F)
    
    game_path <- Map(cbind, game_path, "Team" = List_2_letters)
    
    final[[i]][[j]][[k]] <- game_path
    
    }
  }
}

I get a bunch of lists that look like the following:

enter image description here

What I am trying to do is to combine all of the lists that have the data.frame values.

I tried all of these:

final_2 = map_dfr(final, ~ bind_rows(.x))
final_2 <- as.data.frame(do.call(cbind, final))
final_2 <- do.call("rbind", final)

But they all just produce 2 lists side-by-side. I am actually quite stuck as to how this can be solved?


Solution

  • You could achieve your desired result more easily by not creating a nested list. I refactored your code by first putting the main scraping code in a function for easier debugging and testing. Already in this function I bind the team tables into one dataframe using dplyr::bind_rows. The function should also be a bit more efficient as it avoids reading the HTML two times as in your code.

    For the loop part I switched to purrr::map. To this end create a dataframe containing the dates and the respective URL. This way you can loop directly over the urls without the need of a nested for loop. As a result you will get a list of dataframes which you can finally bind together by rows.

    Finally, note that I dropped tryCatch and use purrr::safely instead for the error handling.

    library(rvest)
    library(purrr)
    library(stringr)
    library(dplyr)
    
    make_url <- function(year, month, day) {
      paste0(
        "https://www.baseball-reference.com/boxes/ARI/ARI",
        year, month, day, ".shtml"
      )
    }
    
    scrape_table <- function(url) {
      html <- read_html(url)
    
      nodes <- html |>
        html_elements(xpath = '//div[starts-with(@id, "all_") and contains(@id, "batting")]')
    
      teams <- html %>%
        html_elements(xpath = "//td/a") %>%
        html_text()
    
      nodes |>
        purrr::set_names(teams) |>
        purrr::map(\(x) {
          x |>
            as.character() |>
            str_remove_all("<!--|-->") |>
            read_html() |>
            html_table()
        }) |>
        unlist(recursive = FALSE) |>
        dplyr::bind_rows(.id = "Team")
    }
    
    
    # declare variables
    month <- c("07", "09")
    year <- c("2022")
    day <- c("040")
    
    dates <- expand.grid(
      year = year, month = month, day = day
    )
    
    urls <- dates |>
      mutate(
        url = make_url(year, month, day),
        date = paste(year, month, day, sep = "-"),
        .keep = "unused"
      )
    
    safe_scrape_table <- purrr::safely(scrape_table)
    
    final <- purrr::map(urls$url, \(url) {
      Sys.sleep(5)
      safe_scrape_table(url)
    }) |>
      set_names(urls$date)
    
    final <- final |>
      purrr::transpose() |>
      pluck("result") |>
      bind_rows(.id = "Date")
    
    head(final)
    #> # A tibble: 6 × 26
    #>   Date       Team  Batting    AB     R     H   RBI    BB    SO    PA    BA   OBP
    #>   <chr>      <chr> <chr>   <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl>
    #> 1 2022-07-0… San … Austin…     2     0     0     0     1     1     3 0.243 0.367
    #> 2 2022-07-0… San … Mike Y…     2     0     0     0     0     1     2 0.236 0.338
    #> 3 2022-07-0… San … Wilmer…     3     1     0     0     0     2     4 0.242 0.331
    #> 4 2022-07-0… San … Darin …     2     1     0     0     1     2     4 0.22  0.335
    #> 5 2022-07-0… San … Evan L…     3     1     1     0     1     0     4 0.248 0.333
    #> 6 2022-07-0… San … LaMont…     4     0     1     2     0     0     4 0.22  0.313
    #> # ℹ 14 more variables: SLG <dbl>, OPS <dbl>, Pit <int>, Str <int>, WPA <dbl>,
    #> #   aLI <dbl>, `WPA+` <dbl>, `WPA-` <chr>, cWPA <chr>, acLI <dbl>, RE24 <dbl>,
    #> #   PO <int>, A <int>, Details <chr>
    

    EDIT The issue is that the datatypes of the columns differ from date to date, i.e. for 2022-07-04 the WPA- column is read as a character as the values include a % symbol whereas for 2023-07-04 it contains only a number and hence is read as a numeric. To fix that we need do to do some data cleaning before we can bind the tables. The following code splits the steps in parts. First, extract the tables with the result. Then loop over the list of dataframe and convert character columns which should be numerics to numerics using e.g. readr::parse_number. After doing so the bind should in principle work. I tested for years 2022 and 2023. But of course is it possible that other issues arise. A more safe option would be to convert all columns to characters. Then bind. Then do the data cleaning afterwards and finally convert the columns to numerics.

    final_result <- final |>
      purrr::transpose() |>
      pluck("result")
    
    # Data Cleaning
    final <- map(
      final_result,
      \(x) {
        x |> 
          mutate(across(!c(Team, Batting, Details) & where(is.character), readr::parse_number))
      }
    )
    
    # Bind
    final <- final |>
      bind_rows(.id = "Date")