I have a dataframe in long format of games, their date, and whether the respective team has won. It has the following structure:
GAME_DATE_EST GAME_ID TEAM_ID WIN
<date> <int> <int> <lgl>
1 2015-06-16 41400406 1610612739 FALSE
2 2015-06-16 41400406 1610612744 TRUE
3 2015-06-14 41400405 1610612744 TRUE
4 2015-06-14 41400405 1610612739 FALSE
5 2015-06-11 41400404 1610612739 FALSE
6 2015-06-11 41400404 1610612744 TRUE
7 2015-06-09 41400403 1610612739 TRUE
8 2015-06-09 41400403 1610612744 FALSE
9 2015-06-07 41400402 1610612744 FALSE
10 2015-06-07 41400402 1610612739 TRUE
For each row, I need the most recent previous game that fulfills different conditions. For example, I want a new column with the GAME_ID of the most previous win for the respective team, one with the GAME_ID for the most previous loss, etc.
The only solution I found so far is with this function and using it with rowwise()
but it takes ages to complete:
get_most_recent_win = function(df, team_id, date) {
temp = subset(df, WIN & TEAM_ID == team_id & GAME_DATE_EST < date,
select = c("GAME_ID", "GAME_DATE_EST"))
if (nrow(temp) > 0) {
return(temp[which.max(temp$GAME_DATE_EST), "GAME_ID"])
} else {
return(NA)
}
}
games_longer[1:50, ] %>%
rowwise() %>%
mutate(most_recent_win = get_most_recent_win(., TEAM_ID, GAME_DATE_EST)) %>%
select(GAME_DATE_EST, GAME_ID, most_recent_win, TEAM_ID, WIN)
Generally, what would be the tidiest and most efficient way to solve such problems?
Here is the data for you to try:
structure(list(GAME_DATE_EST = structure(c(16602, 16602, 16600,
16600, 16597, 16597, 16595, 16595, 16593, 16593, 16590, 16590,
16582, 16582, 16581, 16581, 16580, 16580, 16579, 16579), class = "Date"),
GAME_ID = c(41400406L, 41400406L, 41400405L, 41400405L, 41400404L,
41400404L, 41400403L, 41400403L, 41400402L, 41400402L, 41400401L,
41400401L, 41400315L, 41400315L, 41400304L, 41400304L, 41400314L,
41400314L, 41400303L, 41400303L), TEAM_ID = c(1610612739L,
1610612744L, 1610612744L, 1610612739L, 1610612739L, 1610612744L,
1610612739L, 1610612744L, 1610612744L, 1610612739L, 1610612744L,
1610612739L, 1610612744L, 1610612745L, 1610612739L, 1610612737L,
1610612745L, 1610612744L, 1610612739L, 1610612737L), WIN = c(FALSE,
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE,
TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE,
FALSE)), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
First, arrange by GAME_DATE_EST
Then, create a variable with the GAME_ID of the lag
ed wins, filled with NAs. Then, fill
, replacing the NAs with the last valid values.
library(dplyr)
library(tidyr)
df |>
arrange(GAME_DATE_EST, TEAM_ID) |>
group_by(TEAM_ID) |>
mutate(last_win = if_else(lag(WIN), lag(GAME_ID), NA)) |>
fill(last_win) |>
ungroup()
# A tibble: 20 × 5
GAME_DATE_EST GAME_ID TEAM_ID WIN last_win
<date> <int> <int> <lgl> <int>
1 2015-05-24 41400303 1610612739 TRUE NA
2 2015-05-24 41400303 1610612737 FALSE NA
3 2015-05-25 41400314 1610612745 TRUE NA
4 2015-05-25 41400314 1610612744 FALSE NA
5 2015-05-26 41400304 1610612739 TRUE 41400303
6 2015-05-26 41400304 1610612737 FALSE NA
7 2015-05-27 41400315 1610612744 TRUE NA
8 2015-05-27 41400315 1610612745 FALSE 41400314
9 2015-06-04 41400401 1610612744 TRUE 41400315
10 2015-06-04 41400401 1610612739 FALSE 41400304
11 2015-06-07 41400402 1610612744 FALSE 41400401
12 2015-06-07 41400402 1610612739 TRUE 41400304
13 2015-06-09 41400403 1610612739 TRUE 41400402
14 2015-06-09 41400403 1610612744 FALSE 41400401
15 2015-06-11 41400404 1610612739 FALSE 41400403
16 2015-06-11 41400404 1610612744 TRUE 41400401
17 2015-06-14 41400405 1610612744 TRUE 41400404
18 2015-06-14 41400405 1610612739 FALSE 41400403
19 2015-06-16 41400406 1610612739 FALSE 41400403
20 2015-06-16 41400406 1610612744 TRUE 41400405