rweb-scraping

Scraping data from fiba.basketball game overview page in R


FIBA used to have a page that was much more accessible to grab play-by-play and box score data from (example: scrape fiba stats box score).

They have a new game page that looks like this https://www.fiba.basketball/en/events/fiba-americup-2025-qualifiers/games/120186-DOM-MEX#shotChart

I used dev tools to examine the elements and xhr to see if there was an obvious place the underlying data was being held, but I can't see anything that is obvious.

There is one area nested in a script node that seems to have what I would expect to be the underlying play-by-play data. xpath = /html/body/script[47]/text()

If I pull that specific xpath, I can't seem to parse what comes back because there are so many extra backslashes it seems to ruin the structure.

page <- 'https://www.fiba.basketball/en/events/fiba-americup-2025-qualifiers/games/120186-DOM-MEX#shotChart'

my_session <- session(url = page)

my_session %>% html_nodes(xpath = '/html/body/script[47]/text()')

Hoping get guidance on 1 of 2 things.

  1. Is there a more obvious place that the play-by-play data, (or shot coordinates) can be extracted?
  2. Is there a way to transfer what is returned in the xpath I added above into a tabular format?

Solution

  • Yes it looks like the information is stored in that location as a javascript string as JSON data in JSON.
    This is a matter of reading the string removing the extra characters at the beginning and end and covering from JSON. With how the data is structured, it took some trial and error and two steps to get the desired information and store in the "game" variable.

    library(rvest)
    page <- 'https://www.fiba.basketball/en/events/fiba-americup-2025-qualifiers/games/120186-DOM-MEX#shotChart'
    my_session <- session(url = page)
    
    text <-my_session %>% html_elements(xpath = '/html/body/script[47]/text()') %>% html_text()
    
    #remove extra characters at the start and end then extract JSON out
    temp <- substr(text, 20, nchar(text)-1) %>% jsonlite::fromJSON()
    #repeat on the second list item
    data<-jsonlite::fromJSON(substr(temp[[2]], 4, nchar(temp[[2]])))
    #the desired information is stored in the fourth list item
    game<-data[[4]]
    
    game$playersTeamA
    game$playersTeamB
    game$sidebar
    game$minimal
    game$gameData
    game$status
    game$teamColors
    game$game
    game$playByPlay