rpurrrhttrrcurl

Find each gameId by looping through a very large list of URLs and keeping those that exist


I am trying to obtain a list of all gameId's for each boxscore url from here:

https://www.espn.com/nhl/boxscore/_/gameId/

Each URL ends with a specific gameID, e.g.

https://www.espn.com/nhl/boxscore/_/gameId/4014559236

The problem I have is that I don't know the range or numbers of all of the gameIds. For the start of the 2023-2024 season, they appear to start with 4014559236 and increment by 1. But for, say the start of the 2007-2008 season, they begin with 271009021.

I would like to get them from as far back as possible.

I used the code found here, which allows me to specify some gameIds, check if the URL exists and if it does, output the gameId.

My code here just uses three gameIds from the start of the 2023-2024 season:

library(httr)
library(purrr)
library(RCurl)

urls <- paste0("https://www.espn.com/nhl/boxscore/_/gameId/",4014559236:4014559240)

safe_url_logical <- map(urls, http_error)
temp <- cbind(unlist(safe_url_logical), unlist(urls))
colnames(temp) <- c("logical","url")
temp <- as.data.frame(temp)
safe_urls <- temp %>% 
  dplyr::filter(logical=="FALSE")
dead_urls <- temp %>% 
  dplyr::filter(logical=="TRUE")

df_exist <- list()

for (i in 1:nrow(safe_urls)) {
  url <- as.character(safe_urls$url[i])
  exist <- url.exists(url)
  df_exist <- rbind(df_exist, url)
}

urls = df_exist

game_ids = sub('.*\\/', '', urls)
print(game_ids)
[1] "401559238" "401559239" "401559240"

But if I was to specify from say 271009021 to 4014559236, this is an extremely large amount of numbers and URLs to check.

Is there an alternate way which can gain speed and efficiency?

I would also like to obtain the date of each game, altough I haven't been able to find that yet.


Solution

  • You could start at each teams schedule for each year. For example: https://www.espn.com/nhl/team/schedule/_/name/ana/season/2022 (Ducks for 2022-23 season) and extract out the gameID from the "result" column.

    Here is the code for that:

    url <- "https://www.espn.com/nhl/team/schedule/_/name/ana/season/2022"
    page <- read_html(url)
    
    #get the main table
    schedule <- page %>% html_elements("table") 
    
    #now take the each row, take the third column and find the "a" subnode
    # from that subnode extract the link to the game stats
    linkstogames <- schedule %>% html_elements(xpath = ".//tr //td[3] //a") %>%
                        html_attr("href")
    
    
     [1] "https://www.espn.com/nhl/game/_/gameId/401349148" "https://www.espn.com/nhl/game/_/gameId/401349152"
     [3] "https://www.espn.com/nhl/game/_/gameId/401349170" "https://www.espn.com/nhl/game/_/gameId/401349182"
     [5] "https://www.espn.com/nhl/game/_/gameId/401349193" "https://www.espn.com/nhl/game/_/gameId/401349208"
     [7] "https://www.espn.com/nhl/game/_/gameId/401349228" "https://www.espn.com/nhl/game/_/gameId/401349240"
     [9] "https://www.espn.com/nhl/game/_/gameId/401349249" "https://www.espn.com/nhl/game/_/gameId/401349262"
    [11] "https://www.espn.com/nhl/game/_/gameId/401349275" "https://www.espn.com/nhl/game/_/gameId/401349293