This is my dataset with three columns.
ID Flag Z
121 1 54.027
122 1 41.7267
121 2 26.006
645 1 86.612
217 1 95.831
217 2 6.616
I like to create a fourth column which is a categorical variable with three levels based on the values in Z and Flag.
If the Flag = 2 then NewCol = "Flag2"
If the Flag = 1 and value of Z is above or equal to the median of Z then NewCol = "GT EQ Median"
If the Flag = 1 and value of Z is below the median of Z then NewCol = "LT Median"
The median of Z is estimated only for Z values where Flag=1, in this case median(c(54.027, 41.7267, 86.612, 95.831))
~ 70.32. So if Flag = 1 and Z >= 54.027 then NewCol = "GT EQ Median", if Flag = 1 and Z < 54.027 then NewCol = "LT Median"
ID Flag Z NewCol
121 1 54.027 LT Median
122 1 41.7267 LT Median
121 2 26.006 Flag2
645 1 86.612 GT EQ Median
217 1 95.831 GT EQ Median
217 2 6.616 Flag2
I tried this with case_when
function but I was not sure how to include the logic to check for median of Z when Flag=1. Any suggestion here is much appreciated. Thanks.
---------- question updated based on Benson and Darren's suggestion.
With the dplyr
package, you can specify per group calculation using group_by()
, and specify multiple condition with and &
.
Also, the input for median()
should be a vector of numeric values, therefore you should wrap your numbers with c()
. Otherwise, it would return the first number as the median, which is clearly incorrect.
median(c(54.027, 41.7, 86.6, 95.8))
[1] 70.3135
library(dplyr)
df %>%
group_by(Flag) %>%
mutate(newCol = case_when(Flag == 2 ~ "Flag2",
Flag == 1 & Z >= median(Z) ~ "GT EQ Median",
Flag == 1 & Z < median(Z) ~ "LT Median")) %>%
ungroup()
# A tibble: 6 × 4
ID Flag Z newCol
<int> <int> <dbl> <chr>
1 121 1 54.0 LT Median
2 122 1 41.7 LT Median
3 121 2 26.0 Flag2
4 645 1 86.6 GT EQ Median
5 217 1 95.8 GT EQ Median
6 217 2 6.62 Flag2