rdataframedplyrmutate

efficient R code to assign timestamp (vintage) to each row in a dataframe, based on related rows in the same dataframe


I have a dataset of customer transactions, and need to create a new column (Vintage), based on a fairly complicated set of 'temporal' rules:

This might sound quite convoluted, so I have included the image below to illustrate the desired outcome under all possible scenarios:

enter image description here

code to recreate the toy dataset:

structure(list(Date = c(20240430L, 20240531L, 20240630L, 20240229L, 
20240331L, 20240531L, 20240630L, 20240731L, 20230930L, 20231031L, 
20231231L, 20230930L, 20231031L, 20231231L, 20240531L, 20240731L
), Name = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 4L, 4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"), 
    Movement = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 
    1L, 1L, 1L, 1L, 2L, 1L, 2L), .Label = c("EXISTING", "NEW"
    ), class = "factor"), Vintage = structure(c(5L, 5L, 5L, 5L, 
    5L, 3L, 3L, 3L, 1L, 1L, 1L, 5L, 5L, 2L, 2L, 4L), .Label = c("20230930", 
    "20231231", "20240531", "20240731", "Back Book"), class = "factor")), class = "data.frame", row.names = c(NA, 
-16L))

I am looking for a code which can create the 'Vintage' column in an efficient manner, as the dataset has over 100k entries and keeps growing. Thanks!

I have tried creating two copies of the dataset (df_A, df_B), and looking up the Vintage for each row from df_A into df_B, using rowwise(), mutate() and a custom 'assignVintage' function called from within mutate(), however the code only performs subset of the required temporal rules and despite that is still very inefficient, running for over 20minutes - hence why I am looking for a more elegant solution (if one exist).


Solution

  • Here is a solution.

    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    
    df1 %>%
      mutate(new = cumsum(Movement == "NEW"), .by = Name) %>%
      mutate(
        Vintage2 = case_when(
          new == 0 ~ "Back Book", 
          new > 0 & Movement == "NEW" ~ as.character(Date),
          TRUE ~ NA_character_
        ), 
        .by = new
      ) %>%
      tidyr::fill(Vintage2) %>%
      select(-new)
    #>        Date Name Movement   Vintage  Vintage2
    #> 1  20240430    A EXISTING Back Book Back Book
    #> 2  20240531    A EXISTING Back Book Back Book
    #> 3  20240630    A EXISTING Back Book Back Book
    #> 4  20240229    B EXISTING Back Book Back Book
    #> 5  20240331    B EXISTING Back Book Back Book
    #> 6  20240531    B      NEW  20240531  20240531
    #> 7  20240630    B EXISTING  20240531  20240531
    #> 8  20240731    B EXISTING  20240531  20240531
    #> 9  20230930    C      NEW  20230930  20230930
    #> 10 20231031    C EXISTING  20230930  20230930
    #> 11 20231231    C EXISTING  20230930  20230930
    #> 12 20230930    D EXISTING Back Book Back Book
    #> 13 20231031    D EXISTING Back Book Back Book
    #> 14 20231231    D      NEW  20231231  20231231
    #> 15 20240531    D EXISTING  20231231  20231231
    #> 16 20240731    D      NEW  20240731  20240731
    

    Created on 2024-08-21 with reprex v2.1.0