rdplyrtidyr

Expand existing table into a "versioned up-to-date statues" table


I want to generate a "tracked history" or "growing history" table, based on a source table. Consider this example:

library(tibble)

df_source <- 
  tibble::tribble(
  ~user_id, ~iteration,     ~animal, ~foo_val, ~bar_val,
     "xyz",         1L,     "zebra",      12L,       2L,
     "xyz",         1L,    "monkey",       3L,       4L,
     "xyz",         1L,  "elephant",       4L,      13L,
     "xyz",         2L,     "horse",      11L,       4L,
     "xyz",         2L,       "dog",       5L,      12L,
     "xyz",         2L,       "cow",       2L,      12L,
     "tty",         1L,   "giraffe",       8L,       4L,
     "tty",         1L,  "kangaroo",       6L,       1L,
     "tty",         1L,   "penguin",      12L,       5L,
     "tty",         2L, "crocodile",      10L,       7L,
     "tty",         2L,       "owl",       7L,       8L,
     "tty",         2L,  "kangaroo",      14L,       2L
  )

What I want is to "expand" this table such that it will look like a history table, that keeps track of the status of each user, as it was relevant for each iteration. This means that in iteration 1, we knew little - only about iteration 1. But in iteration 2, we knew about iteration 1 and 2, so the data is richer in iteration 2.

In fact, if there is a conflict between iteration 1's data and iteration 2's data (say the same animal but different values), the for iteration 2, the conflict is resolved by considering iteration 2's version over 1's.

So in this case, the result table should be:

user_id status_as_of_iteration animal foo_status bar_status my_comment
xyz 1 zebra 12 2 original iteration 1 remains
xyz 1 monkey 3 4 original iteration 1 remains
xyz 1 elephant 4 13 original iteration 1 remains
tty 1 giraffe 8 4 original iteration 1 remains
tty 1 kangaroo 6 1 original iteration 1 remains
tty 1 penguin 12 5 original iteration 1 remains
xyz 2 zebra 12 2 preserved history from iteration 1, carried over to iteration 2, nothing to override
xyz 2 monkey 3 4 preserved history from iteration 1, carried over to iteration 2, nothing to override
xyz 2 elephant 4 13 preserved history from iteration 1, carried over to iteration 2, nothing to override
xyz 2 horse 11 4 original iteration 2 remains
xyz 2 dog 5 12 original iteration 2 remains
xyz 2 cow 2 12 original iteration 2 remains
tty 1 giraffe 8 4 preserved history from iteration 1, carried over to iteration 2, nothing to override
tty 1 penguin 12 5 preserved history from iteration 1, carried over to iteration 2, nothing to override
tty 2 crocodile 10 7 original iteration 2
tty 2 owl 7 8 original iteration 2
tty 2 kangaroo 14 2 NOTE! original iteration 2, values of iteration 2 overrode iteration 1's vals

so it actually means to:

  1. replicate the previous iteration records per user, but increment their number for status_as_of_iteration
  2. keep the most recent iteration records, per user, as is.

but please note how user_id == tty in status_as_of_iteration == 2 has only a single record for animal == kangaroo, and this is because we preserve the most recent/up-to-date "status" for each animal in each iteration.


Does anyone know of a "simple" way to do it with dplyr or tidyr?


Solution

  • Using some other tidy package you can try this

    library(dplyr)
    library(tidyr)
    library(purrr)
    
    df_source %>% 
      nest(data=c(animal, foo_val, bar_val)) %>% 
      mutate(data = accumulate(data, rows_upsert, by="animal"), .by=user_id) %>% 
      unnest(cols=data)
    

    which returns

       user_id iteration animal    foo_val bar_val
       <chr>       <int> <chr>       <int>   <int>
     1 xyz             1 zebra          12       2
     2 xyz             1 monkey          3       4
     3 xyz             1 elephant        4      13
     4 xyz             2 zebra          12       2
     5 xyz             2 monkey          3       4
     6 xyz             2 elephant        4      13
     7 xyz             2 horse          11       4
     8 xyz             2 dog             5      12
     9 xyz             2 cow             2      12
    10 tty             1 giraffe         8       4
    11 tty             1 kangaroo        6       1
    12 tty             1 penguin        12       5
    13 tty             2 giraffe         8       4
    14 tty             2 kangaroo       14       2
    15 tty             2 penguin        12       5
    16 tty             2 crocodile      10       7
    17 tty             2 owl             7       8
    

    We nest the data, and then iteratively rows_upsert to update the the values in each group.