rdplyrtidyrlag

R: Find most recent data that fulfills conditions


I have a dataframe in long format of games, their date, and whether the respective team has won. It has the following structure:

   GAME_DATE_EST GAME_ID  TEAM_ID    WIN  
   <date>        <int>    <int>      <lgl>
 1 2015-06-16    41400406 1610612739 FALSE
 2 2015-06-16    41400406 1610612744 TRUE 
 3 2015-06-14    41400405 1610612744 TRUE 
 4 2015-06-14    41400405 1610612739 FALSE
 5 2015-06-11    41400404 1610612739 FALSE
 6 2015-06-11    41400404 1610612744 TRUE 
 7 2015-06-09    41400403 1610612739 TRUE 
 8 2015-06-09    41400403 1610612744 FALSE
 9 2015-06-07    41400402 1610612744 FALSE
10 2015-06-07    41400402 1610612739 TRUE 

For each row, I need the most recent previous game that fulfills different conditions. For example, I want a new column with the GAME_ID of the most previous win for the respective team, one with the GAME_ID for the most previous loss, etc.

The only solution I found so far is with this function and using it with rowwise() but it takes ages to complete:

get_most_recent_win = function(df, team_id, date) {
  temp = subset(df, WIN & TEAM_ID == team_id & GAME_DATE_EST < date,
                select = c("GAME_ID", "GAME_DATE_EST"))
  
  if (nrow(temp) > 0) {
    return(temp[which.max(temp$GAME_DATE_EST), "GAME_ID"])
  } else {
    return(NA)
  }
}

games_longer[1:50, ] %>%
  rowwise() %>% 
  mutate(most_recent_win = get_most_recent_win(., TEAM_ID, GAME_DATE_EST)) %>% 
  select(GAME_DATE_EST, GAME_ID, most_recent_win, TEAM_ID, WIN)

Generally, what would be the tidiest and most efficient way to solve such problems?

Here is the data for you to try:

structure(list(GAME_DATE_EST = structure(c(16602, 16602, 16600, 
16600, 16597, 16597, 16595, 16595, 16593, 16593, 16590, 16590, 
16582, 16582, 16581, 16581, 16580, 16580, 16579, 16579), class = "Date"), 
    GAME_ID = c(41400406L, 41400406L, 41400405L, 41400405L, 41400404L, 
    41400404L, 41400403L, 41400403L, 41400402L, 41400402L, 41400401L, 
    41400401L, 41400315L, 41400315L, 41400304L, 41400304L, 41400314L, 
    41400314L, 41400303L, 41400303L), TEAM_ID = c(1610612739L, 
    1610612744L, 1610612744L, 1610612739L, 1610612739L, 1610612744L, 
    1610612739L, 1610612744L, 1610612744L, 1610612739L, 1610612744L, 
    1610612739L, 1610612744L, 1610612745L, 1610612739L, 1610612737L, 
    1610612745L, 1610612744L, 1610612739L, 1610612737L), WIN = c(FALSE, 
    TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, 
    TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, 
    FALSE)), row.names = c(NA, -20L), class = c("tbl_df", "tbl", 
"data.frame"))

Solution

  • First, arrange by GAME_DATE_EST Then, create a variable with the GAME_ID of the laged wins, filled with NAs. Then, fill, replacing the NAs with the last valid values.

    library(dplyr)
    library(tidyr)
    df |> 
        arrange(GAME_DATE_EST, TEAM_ID) |> 
        group_by(TEAM_ID) |> 
        mutate(last_win = if_else(lag(WIN), lag(GAME_ID), NA)) |> 
        fill(last_win) |> 
        ungroup()
        
    
    # A tibble: 20 × 5
       GAME_DATE_EST  GAME_ID    TEAM_ID WIN   last_win
       <date>           <int>      <int> <lgl>    <int>
     1 2015-05-24    41400303 1610612739 TRUE        NA
     2 2015-05-24    41400303 1610612737 FALSE       NA
     3 2015-05-25    41400314 1610612745 TRUE        NA
     4 2015-05-25    41400314 1610612744 FALSE       NA
     5 2015-05-26    41400304 1610612739 TRUE  41400303
     6 2015-05-26    41400304 1610612737 FALSE       NA
     7 2015-05-27    41400315 1610612744 TRUE        NA
     8 2015-05-27    41400315 1610612745 FALSE 41400314
     9 2015-06-04    41400401 1610612744 TRUE  41400315
    10 2015-06-04    41400401 1610612739 FALSE 41400304
    11 2015-06-07    41400402 1610612744 FALSE 41400401
    12 2015-06-07    41400402 1610612739 TRUE  41400304
    13 2015-06-09    41400403 1610612739 TRUE  41400402
    14 2015-06-09    41400403 1610612744 FALSE 41400401
    15 2015-06-11    41400404 1610612739 FALSE 41400403
    16 2015-06-11    41400404 1610612744 TRUE  41400401
    17 2015-06-14    41400405 1610612744 TRUE  41400404
    18 2015-06-14    41400405 1610612739 FALSE 41400403
    19 2015-06-16    41400406 1610612739 FALSE 41400403
    20 2015-06-16    41400406 1610612744 TRUE  41400405