Assuming I have a dataset like this
id time cd4 sequence
1 -0.741958 548 1
1 -0.246407 893 2
1 0.243669 657 3
2 -2.7296369 464 1
2 -2.2505131 845 2
2 -0.221766 752 3
2 0.221766 459 4
2 0.77481198 181 5
2 1.256673 434 6
I like to compute the difference in cd4 values , the earliest value with every other value , within that ID.
ID 1, has 3 observations, earliest is at time -0.741, The code should estimate the difference of cd4 values ,
sequence 1 vs sequence 2 : 893 - 548
sequence 1 vs sequence 3 : 657 - 548
sequence 2 vs sequence 4 : 657 - 893
Id 2 has 6 observations, this is 6Choose2, which is 15 different combinations
So expecting a dataset like this
id time cd4 seq comb cd4_lag
1 -0.74195 548 1 1,1 0
1 -0.24640 893 2 1,2 345
1 0.243669 657 3 1,3 109
1 2,3 -236
2 -2.72963 464 1 1,1
2 -2.25051 845 2 1,2 381
2 -0.22176 752 3 1,3 288
2 0.221766 459 4 1,4 -5
2 0.774811 181 5 1,5 -283
2 1.256673 434 6 1,6 -30
2,3 -93
2,4 -386
2,5 -664
2,6 -411
3,4 -293
3,5 -571
3,6 -318
4,5 -278
4,6 -25
5,6 253
Thanks for any suggestions , advise.
Maybe something like this? I have left out the 1,1 self matches since it wasn't clear to me why those should be included but no other self matches.
library(dplyr)
df |>
inner_join(df |> select(-time),
join_by(id, sequence < sequence)) |>
mutate(cd4_lag = cd4.y - cd4.x)
Result
id time cd4.x sequence.x cd4.y sequence.y cd4_lag
1 1 -0.741958 548 1 893 2 345
2 1 -0.741958 548 1 657 3 109
3 1 -0.246407 893 2 657 3 -236
4 2 -2.729637 464 1 845 2 381
5 2 -2.729637 464 1 752 3 288
6 2 -2.729637 464 1 459 4 -5
7 2 -2.729637 464 1 181 5 -283
8 2 -2.729637 464 1 434 6 -30
9 2 -2.250513 845 2 752 3 -93
10 2 -2.250513 845 2 459 4 -386
11 2 -2.250513 845 2 181 5 -664
12 2 -2.250513 845 2 434 6 -411
13 2 -0.221766 752 3 459 4 -293
14 2 -0.221766 752 3 181 5 -571
15 2 -0.221766 752 3 434 6 -318
16 2 0.221766 459 4 181 5 -278
17 2 0.221766 459 4 434 6 -25
18 2 0.774812 181 5 434 6 253