I want to generate a "tracked history" or "growing history" table, based on a source table. Consider this example:
library(tibble)
df_source <-
tibble::tribble(
~user_id, ~iteration, ~animal, ~foo_val, ~bar_val,
"xyz", 1L, "zebra", 12L, 2L,
"xyz", 1L, "monkey", 3L, 4L,
"xyz", 1L, "elephant", 4L, 13L,
"xyz", 2L, "horse", 11L, 4L,
"xyz", 2L, "dog", 5L, 12L,
"xyz", 2L, "cow", 2L, 12L,
"tty", 1L, "giraffe", 8L, 4L,
"tty", 1L, "kangaroo", 6L, 1L,
"tty", 1L, "penguin", 12L, 5L,
"tty", 2L, "crocodile", 10L, 7L,
"tty", 2L, "owl", 7L, 8L,
"tty", 2L, "kangaroo", 14L, 2L
)
What I want is to "expand" this table such that it will look like a history table, that keeps track of the status of each user, as it was relevant for each iteration. This means that in iteration 1, we knew little - only about iteration 1. But in iteration 2, we knew about iteration 1 and 2, so the data is richer in iteration 2.
In fact, if there is a conflict between iteration 1's data and iteration 2's data (say the same animal
but different values), the for iteration 2, the conflict is resolved by considering iteration 2's version over 1's.
So in this case, the result table should be:
user_id | status_as_of_iteration | animal | foo_status | bar_status | my_comment |
---|---|---|---|---|---|
xyz | 1 | zebra | 12 | 2 | original iteration 1 remains |
xyz | 1 | monkey | 3 | 4 | original iteration 1 remains |
xyz | 1 | elephant | 4 | 13 | original iteration 1 remains |
tty | 1 | giraffe | 8 | 4 | original iteration 1 remains |
tty | 1 | kangaroo | 6 | 1 | original iteration 1 remains |
tty | 1 | penguin | 12 | 5 | original iteration 1 remains |
xyz | 2 | zebra | 12 | 2 | preserved history from iteration 1, carried over to iteration 2, nothing to override |
xyz | 2 | monkey | 3 | 4 | preserved history from iteration 1, carried over to iteration 2, nothing to override |
xyz | 2 | elephant | 4 | 13 | preserved history from iteration 1, carried over to iteration 2, nothing to override |
xyz | 2 | horse | 11 | 4 | original iteration 2 remains |
xyz | 2 | dog | 5 | 12 | original iteration 2 remains |
xyz | 2 | cow | 2 | 12 | original iteration 2 remains |
tty | 1 | giraffe | 8 | 4 | preserved history from iteration 1, carried over to iteration 2, nothing to override |
tty | 1 | penguin | 12 | 5 | preserved history from iteration 1, carried over to iteration 2, nothing to override |
tty | 2 | crocodile | 10 | 7 | original iteration 2 |
tty | 2 | owl | 7 | 8 | original iteration 2 |
tty | 2 | kangaroo | 14 | 2 | NOTE! original iteration 2, values of iteration 2 overrode iteration 1's vals |
so it actually means to:
status_as_of_iteration
but please note how user_id == tty
in status_as_of_iteration == 2
has only a single record for animal == kangaroo
, and this
is because we preserve the most recent/up-to-date "status" for each animal
in each iteration.
Does anyone know of a "simple" way to do it with dplyr or tidyr?
Using some other tidy package you can try this
library(dplyr)
library(tidyr)
library(purrr)
df_source %>%
nest(data=c(animal, foo_val, bar_val)) %>%
mutate(data = accumulate(data, rows_upsert, by="animal"), .by=user_id) %>%
unnest(cols=data)
which returns
user_id iteration animal foo_val bar_val
<chr> <int> <chr> <int> <int>
1 xyz 1 zebra 12 2
2 xyz 1 monkey 3 4
3 xyz 1 elephant 4 13
4 xyz 2 zebra 12 2
5 xyz 2 monkey 3 4
6 xyz 2 elephant 4 13
7 xyz 2 horse 11 4
8 xyz 2 dog 5 12
9 xyz 2 cow 2 12
10 tty 1 giraffe 8 4
11 tty 1 kangaroo 6 1
12 tty 1 penguin 12 5
13 tty 2 giraffe 8 4
14 tty 2 kangaroo 14 2
15 tty 2 penguin 12 5
16 tty 2 crocodile 10 7
17 tty 2 owl 7 8
We nest the data, and then iteratively rows_upsert
to update the the values in each group.