web-scrapingwebhtml-tablenarvest

Web scrapping in R


I am trying to web scrap baseball information of Norfolk Tides for the 2022 season. When I run my code to get the table, it only displays the column names and the first observation with NAs. Please, can someone help me identify the issue and how to resolve it? Thank you in advance.

url = "https://www.baseballamerica.com/teams/3181/norfolk-tides/schedule/?year=2022"

data <- url %>% 
        read_html() %>%
        html_element(".scrolling-table") %>%
        html_table()
data

enter image description here

I have tried html_text to get each column information as a vector and then combine them all in a data.frame but it is not working. Please, I need your help.


Solution

  • The table content is not included in the page source but pulled from statsapi.mlb.com API as JSON, naive yet often effective and fastest way to extract those API calls is though a search in network tab of browsers dev tools (screenshot). This also makes this more of a rectangling problem, turning deeply nested list into flat, tabular form (assuming you don't just want to flatten the whole dataset) . For a somewhat similar content as in rendered table, one might try:

    library(jsonlite)
    library(dplyr)
    library(tidyr)
    
    schedule <- fromJSON("https://statsapi.mlb.com/api/v1/schedule?startDate=1/01/2022&endDate=12/31/2022&gameTypes=R&sportId=11&teamId=568&hydrate=decisions", simplifyVector = F)
    
    schedule_tbl <- tibble(schedule = schedule$dates) %>% 
      unnest_wider(schedule) %>% 
      select(games) %>% 
      unnest_longer(games) %>% 
      hoist(games, 
            "gameDate",
            home = list("teams", "home"),
            away = list("teams", "away"),
            win  = list("decisions", "winner", "fullName"),
            loss = list("decisions", "loser", "fullName"),
            save = list("decisions", "save", "fullName")) %>% 
      unnest_wider(c(home, away), names_sep = "_") %>% 
      hoist(home_team, home = "name") %>% 
      hoist(away_team, away = "name") %>% 
      select(-matches("Record|Squad|Number|decisions|games|team")) 
    

    Parsed API response (schedule) is a deeply nested list, all the relevant data is in schedule$dates and turning it into a tibble creates a table with column of lists and allows to apply unnest_*() and hoist() operations. Every unnest_longer() increases row numbers while keeping column numbers the same while unnest_longer() keeps the number of rows but spreads one level of list elements across columns. hoist() allows to select just few elements from the list. To check that same JSON online, one can use jsoneditoronline.org (link includes reference to the API call)

    Results :

    schedule_tbl
    #> # A tibble: 157 × 10
    #>    gameDate        home_…¹ home  home_…² away_…³ away  away_…⁴ win   loss  save 
    #>    <chr>             <int> <chr> <lgl>     <int> <chr> <lgl>   <chr> <chr> <chr>
    #>  1 2022-04-05T22:…       1 Norf… FALSE         3 Char… TRUE    Will… Ofel… Yack…
    #>  2 2022-04-06T22:…       6 Norf… TRUE          5 Char… FALSE   Cole… Andr… <NA> 
    #>  3 2022-04-07T16:…       7 Norf… FALSE        13 Char… TRUE    Zach… Davi… <NA> 
    #>  4 2022-04-08T23:…      12 Norf… TRUE          1 Char… FALSE   Ryan… Emil… <NA> 
    #>  5 2022-04-09T22:…       7 Norf… TRUE          6 Char… FALSE   Diog… Ande… Marc…
    #>  6 2022-04-10T17:…       4 Norf… FALSE         5 Char… TRUE    Andr… Kevi… Will…
    #>  7 2022-04-12T22:…       1 Scra… FALSE         3 Norf… TRUE    Ryan… Matt… Nick…
    #>  8 2022-04-13T22:…       1 Scra… FALSE         5 Norf… TRUE    Cody… Hayd… <NA> 
    #>  9 2022-04-14T22:…       5 Scra… FALSE        12 Norf… TRUE    Gray… Deiv… <NA> 
    #> 10 2022-04-15T22:…       7 Scra… FALSE        14 Norf… TRUE    Ofel… Luis… <NA> 
    #> # … with 147 more rows, and abbreviated variable names ¹​home_score,
    #> #   ²​home_isWinner, ³​away_score, ⁴​away_isWinner
    
    
    glimpse(schedule_tbl)
    #> Rows: 157
    #> Columns: 10
    #> $ gameDate      <chr> "2022-04-05T22:35:00Z", "2022-04-06T22:35:00Z", "2022-04…
    #> $ home_score    <int> 1, 6, 7, 12, 7, 4, 1, 1, 5, 7, NA, 1, 1, 7, 3, 6, 5, 5, …
    #> $ home          <chr> "Norfolk Tides", "Norfolk Tides", "Norfolk Tides", "Norf…
    #> $ home_isWinner <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FAL…
    #> $ away_score    <int> 3, 5, 13, 1, 6, 5, 3, 5, 12, 14, NA, 5, 0, 5, 0, 11, 4, …
    #> $ away          <chr> "Charlotte Knights", "Charlotte Knights", "Charlotte Kni…
    #> $ away_isWinner <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE,…
    #> $ win           <chr> "Will Carter", "Cole Uvila", "Zach Muckenhirn", "Ryan Co…
    #> $ loss          <chr> "Ofelky Peralta", "Andrew Perez", "David Lebron", "Emili…
    #> $ save          <chr> "Yacksel Rios", NA, NA, NA, "Marcos Diplan", "Will Carte…
    

    Created on 2023-03-05 with reprex v2.0.2

    While jsonlite can simplify and flatten, in this particular case defining all rectangling steps manually seems to work better. Note that gameDate is in UTC, not local time. And there's this copyright notice in every API response:

    Copyright 2023 MLB Advanced Media, L.P. Use of any content on this page acknowledges agreement to the terms posted here http://gdx.mlb.com/components/copyright.txt",