rweb-scrapingrvest

How to Scrape NBA stats page using rvest


The page I am interested in scraping is here: https://www.nba.com/stats/teams/opponent-shots-general?GeneralRange=Pullups&SeasonType=Regular+Season

I have the following code which I have tried running

library(httr2)

req_url <- "https://www.nba.com/stats/teams/opponent-shots-general?GeneralRange=Pullups&SeasonType=Regular+Season"

json <- 
  request(req_url) |>
  req_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36") |>
  req_headers(
    Accept = "*/*",
    Origin = "https://www.nba.com",
    Referer = "https://www.nba.com/",
  ) |> 
  req_perform() |>
  resp_body_json() 

hdr <- json$resultSets$headers

# build column names from 2-level header structure
clean_names <- 
  c(
    rep("", hdr[[1]]$columnsToSkip), 
    rep(unlist(hdr[[1]]$columnNames), each = 3)
  ) |>
  paste(unlist(hdr[[2]]$columnNames)) |>
  janitor::make_clean_names()
clean_names


# make each list in a `rowSet` a named list, 
# this allows us to use dplyr::bind_rows() to create a tibble
data <- json$resultSets$rowSet |>
  lapply(setNames, clean_names) |>
  dplyr::bind_rows()

When I run this I get the following error

Error in `resp_body_json()`:
! Unexpected content type "text/html".
• Expecting type "application/json" or suffix "json".

Could someone point me to what the req_url variable should be please


Solution

  • For the context, OP was trying to modify a solution suggested for Webscraping NBA.com


    Could someone point me to what the req_url variable should be please

    For this approach you'd first need to figure out the actual API call that is made by your web browser when it fetches data for the table. Common aproach is to extract this interactively through the Network tab of browser's developer tools (F12 or Ctrl+Shift+I in Chrome-based browsers in Windows). Make sure DevTools is open when you load or refresh the page to capture all request. To reduce number of candidate calls from tens or hundreds to few, you can start by defining a filter and/or going through a search (Ctrl+f when Network tab is active), looking for for some specific phrase or value from the target table is usually much faster than scrolling through a full list of requests: DevTools Network tab screenshot

    Just be aware that rendered values might not be in the same format as delivered content (e.g. 0.287000 from JSON response vs formatted 28.7 (%) in rendered table). You can get API url through request's context menu or Headers pane.

    And note that different API endpoints might or might not share the same response structure, meaning that you are not guaranteed to get it working by just plugging in a different API request while not modifying JSON parsing suggested in some previous answer.


    As an alternative, you might try rvest::read_html_live(). It first renders requested page (incl. dynamic content, e.g. Javascript-rendered tables) in a headless browsers, allows you to interact with a live browser session and to work with the same Javascript-modified DOM-tree you'd see in DevTools object inspector of your desktop browser:

    library(rvest)
    
    url_ <- "https://www.nba.com/stats/teams/opponent-shots-general?GeneralRange=Pullups&SeasonType=Regular+Season" 
    read_html_live(url_) |>
      html_element("table.Crom_table__p1iZz") |> 
      html_table() 
    #> # A tibble: 31 × 19
    #>    ``    ``    ``    ``    ``    ``    `Field Goals` `Field Goals` `Field Goals`
    #>    <chr> <chr> <chr> <chr> <chr> <chr> <chr>         <chr>         <chr>        
    #>  1 TEAM  GP    G     Freq% FGM   FGA   FG%           eFG%          2FG Freq%    
    #>  2 Milw… 82    82    28.7  10.3  26.3  39.3          46.2          16.3         
    #>  3 Los … 81    81    27.0  10.0  24.9  40.2          47.7          15.6         
    #>  4 Bost… 81    81    26.9  9.4   24.6  38.0          45.5          14.4         
    #>  5 Minn… 82    82    27.4  8.7   23.7  36.8          42.5          17.1         
    #>  6 Dall… 81    81    26.2  9.2   23.6  38.9          45.9          15.2         
    #>  7 San … 81    81    25.5  9.7   23.6  41.1          48.3          15.0         
    #>  8 Hous… 82    82    26.8  9.3   23.6  39.6          47.4          14.5         
    #>  9 Orla… 82    82    27.8  9.1   23.3  39.0          46.8          15.5         
    #> 10 Broo… 79    79    26.4  9.1   23.3  39.1          46.9          14.2         
    #> # ℹ 21 more rows
    #> # ℹ 10 more variables: `Field Goals` <chr>, `Field Goals` <chr>,
    #> #   `2 Point Field Goals` <chr>, `2 Point Field Goals` <chr>,
    #> #   `2 Point Field Goals` <chr>, `2 Point Field Goals` <chr>,
    #> #   `3 Point Field Goals` <chr>, `3 Point Field Goals` <chr>,
    #> #   `3 Point Field Goals` <chr>, `3 Point Field Goals` <chr>
    

    Created on 2024-09-18 with reprex v2.1.1