I have a dataset of customer transactions, and need to create a new column (Vintage), based on a fairly complicated set of 'temporal' rules:
in a nutshell, the requirement is to tag each entry (row) by the Date the Name came on book (i.e. when were they a “NEW” 'Movement'). In practice, there are several situations to consider:
if the name first appeared in the dataset with 'EXISTING' tag, then Vintage = 'Back Book'
if for the same name there is a 'NEW' tag at a later reporting date - Vintage becomes the Date of the 'NEW' entry
for 'EXISTING' entries following a 'NEW' tag - Vintage = the most recent 'NEW' entry Date
if there is a new 'NEW' entry for the name, the Vintage value becomes the Date of the new 'NEW' entry
This might sound quite convoluted, so I have included the image below to illustrate the desired outcome under all possible scenarios:
code to recreate the toy dataset:
structure(list(Date = c(20240430L, 20240531L, 20240630L, 20240229L,
20240331L, 20240531L, 20240630L, 20240731L, 20230930L, 20231031L,
20231231L, 20230930L, 20231031L, 20231231L, 20240531L, 20240731L
), Name = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
Movement = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 2L, 1L, 2L), .Label = c("EXISTING", "NEW"
), class = "factor"), Vintage = structure(c(5L, 5L, 5L, 5L,
5L, 3L, 3L, 3L, 1L, 1L, 1L, 5L, 5L, 2L, 2L, 4L), .Label = c("20230930",
"20231231", "20240531", "20240731", "Back Book"), class = "factor")), class = "data.frame", row.names = c(NA,
-16L))
I am looking for a code which can create the 'Vintage' column in an efficient manner, as the dataset has over 100k entries and keeps growing. Thanks!
I have tried creating two copies of the dataset (df_A, df_B), and looking up the Vintage for each row from df_A into df_B, using rowwise(), mutate() and a custom 'assignVintage' function called from within mutate(), however the code only performs subset of the required temporal rules and despite that is still very inefficient, running for over 20minutes - hence why I am looking for a more elegant solution (if one exist).
Here is a solution.
Movement
is "NEW"
;"Back Book"
, if it is and it's the group's first element, assign the date, else assign NA
;NA
's with the previous value.library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df1 %>%
mutate(new = cumsum(Movement == "NEW"), .by = Name) %>%
mutate(
Vintage2 = case_when(
new == 0 ~ "Back Book",
new > 0 & Movement == "NEW" ~ as.character(Date),
TRUE ~ NA_character_
),
.by = new
) %>%
tidyr::fill(Vintage2) %>%
select(-new)
#> Date Name Movement Vintage Vintage2
#> 1 20240430 A EXISTING Back Book Back Book
#> 2 20240531 A EXISTING Back Book Back Book
#> 3 20240630 A EXISTING Back Book Back Book
#> 4 20240229 B EXISTING Back Book Back Book
#> 5 20240331 B EXISTING Back Book Back Book
#> 6 20240531 B NEW 20240531 20240531
#> 7 20240630 B EXISTING 20240531 20240531
#> 8 20240731 B EXISTING 20240531 20240531
#> 9 20230930 C NEW 20230930 20230930
#> 10 20231031 C EXISTING 20230930 20230930
#> 11 20231231 C EXISTING 20230930 20230930
#> 12 20230930 D EXISTING Back Book Back Book
#> 13 20231031 D EXISTING Back Book Back Book
#> 14 20231231 D NEW 20231231 20231231
#> 15 20240531 D EXISTING 20231231 20231231
#> 16 20240731 D NEW 20240731 20240731
Created on 2024-08-21 with reprex v2.1.0