I have a long df
that looks like this:
df <- data.frame(id=as.integer(c(123,123,123,124,124,124,125,125,126,126,126)),
date=as.Date(c("2014-03-12", "2015-05-02", "2015-09-16", "2015-10-24", "2016-12-11", "2017-10-17", "2017-08-06", "2018-01-29", "2015-09-16", "2015-11-12", "2015-12-03")),
event=as.character(c("A", "C", "E", "A", "B", "D", "A", "E", "A", "B", "E")),
order=as.integer(c(1,2,3,1,2,3,1,2,1,2,3)),
diff=as.integer(c(0,416,553,0,414,724,0,176,0,57,78)))
df
id date event order diff
1 123 2014-03-12 A 1 0
2 123 2015-05-02 C 2 416
3 123 2015-09-16 E 3 553
4 124 2015-10-24 A 1 0
5 124 2016-12-11 B 2 414
6 124 2017-10-17 D 3 724
7 125 2017-08-06 A 1 0
8 125 2018-01-29 E 2 176
9 126 2015-09-16 A 1 0
10 126 2015-11-12 B 2 57
11 126 2015-12-03 E 3 78
Each id
will always have an initial event A
and a final event, either D
or E
(mutually exclusive). Events B
and C
might or might not occur. diff
is the difference in days
between the date
of each event and the date
of the initial event A
per id
.
I want to obtain a wide df
where each event will be a column
(e.g. A_status
) where 0= absent
; 1= present
. Similarly, each correspondent diff
will be a column
(e.g. A_time
). But when event B
or C
is absent (e.g B= 0
or C= 0
), I want their time
to be filled with either D_time
or E_time
, whichever is present.
I need to create two columns
based on the values
of D
and E
:
column
D.E_status
, where 0=D
; 1=E
, andcolumn
D.E_time
that will receive whichever time
is recorded (of D
or E
).This is the desired output:
id A_status A_time B_status B_time C_status C_time D.E_status D.E_time
123 1 0 0 553 1 416 1 553
124 1 0 1 414 0 724 0 724
125 1 0 0 176 0 176 1 176
126 1 0 1 57 0 78 1 78
I really appreciate your help on this given my very basic R skills.
I'll demonstrate a dplyr
/tidyr
solution. Up front, I think your last value for B_time
should be 57
not 78
.
library(dplyr)
library(tidyr) # pivot_wider
df %>%
mutate(
status = if_else(event == "D", 0, 1),
event = if_else(event %in% c("D", "E"), "D.E", event)
) %>%
pivot_wider(
id_cols = "id",
names_from = "event", values_from = c("diff", "status")
) %>%
rename_with(.fn = ~ sub("diff", "time", sub("(.*)_(.*)", "\\2_\\1", .))) %>%
mutate(
across(c(C_time, B_time), ~ coalesce(., D.E_time)),
across(c(C_status, B_status), ~ +(!is.na(.)))
) %>%
select(order(colnames(.))) %>%
relocate(id)
# # A tibble: 4 × 9
# id A_status A_time B_status B_time C_status C_time D.E_status D.E_time
# <int> <dbl> <int> <int> <int> <int> <int> <dbl> <int>
# 1 123 1 0 0 553 1 416 1 553
# 2 124 1 0 1 414 0 724 0 724
# 3 125 1 0 0 176 0 176 1 176
# 4 126 1 0 1 57 0 78 1 78