I am trying to wrangle some data for a cox regression...
#generate some data
set.seed(1)
ID <- sort(rep(1:10, times = 5))
conditions <- rep(c("asthma", "copd", "af", "cvd", "ckd"), times = 10)
day <- sample(1:100, 50)
#assign to dataframe
df <- data.frame(ID, conditions, day)
I have data in a long format like this:
ID | conditions | day |
---|---|---|
1 | asthma | 68 |
1 | copd | 39 |
1 | af | 1 |
1 | cvd | 34 |
1 | ckd | 87 |
2 | asthma | 43 |
2 | copd | 14 |
2 | af | 82 |
2 | cvd | 59 |
2 | ckd | 51 |
And I need it wrangled to this:
As you can see, ID=1 develops AF on day 1, cvd on day 34 and copd on day 39…
So assuming that this is in order of date…
In rownum 1, the af column changes to 1…
In rownum 2, the af AND cvd changes to 1…
In rownum3, the af AND cvd AND copd changes to 1…
Then it would be the same kind of pattern for all the other IDs.
rownum | ID | day | asthma | copd | af | cvd |
---|---|---|---|---|---|---|
1 | 1 | 1 | 0 | 0 | 1 | 0 |
2 | 1 | 34 | 0 | 0 | 1 | 1 |
3 | 1 | 39 | 0 | 1 | 1 | 1 |
4 | 1 | 68 | 1 | 1 | 1 | 1 |
5 | 2 | 14 | 0 | 1 | 0 | 0 |
6 | 2 | 43 | 1 | 1 | 0 | 0 |
7 | 2 | 51 | 1 | 1 | 0 | 1 |
… | … | … | … | … | … | … |
I've tried using a lag function, but it just doesn't work... the lag needs to work for multiple columns as you can see above.
dt[,temp:=ifelse(is.na(reglag(event_dt,1)), as.integer(0), reglag(event_dt,1)), by=ID]
dt[, sequence:=cumsum(temp)+1, by=ID]
func = function(x)
{
which(c(1,lag(x,1)[-1]) %in% 1) %>%
c(length(x)+1) %>%
diff
}
reglag = function(x,lag) {c(rep(NA,lag), x[lag:(length(x)-1)])}
dt[, cond.time:=func(event_dt) %>% lapply(seq) %>% unlist, by=ID]
Would be very grateful for any help you could give. I also have a massive table, so maybe a loop would cause me memory issues...
Many many thanks in advance ~R
Arrange
by ID and day, and pivot_wider
. You'll get 1
for the disease at day d, 0
elsewhere. Use cumsum
to add 1 to the consecutive values for each column.
library(dplyr)
library(tidyr)
df %>%
arrange(ID, day) %>%
mutate(value = 1) %>%
pivot_wider(names_from = conditions, values_fill = 0) %>%
group_by(ID) %>%
mutate(across(af:ckd, cumsum))
output
ID day af cvd copd asthma ckd
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 0 0 0 0
2 1 34 1 1 0 0 0
3 1 39 1 1 1 0 0
4 1 68 1 1 1 1 0
5 1 87 1 1 1 1 1
6 2 14 0 0 1 0 0
7 2 43 0 0 1 1 0
8 2 51 0 0 1 1 1
9 2 59 0 1 1 1 1
10 2 82 1 1 1 1 1
# … with 40 more rows