rdataframedplyrcasedata-transform

Create categorical variables from continuous variables


This is my dataset with three columns.

  ID     Flag    Z
  121    1       54.027
  122    1       41.7267
  121    2       26.006
  645    1       86.612
  217    1       95.831
  217    2        6.616

I like to create a fourth column which is a categorical variable with three levels based on the values in Z and Flag.

If the Flag = 2 then NewCol = "Flag2"
If the Flag = 1 and value of Z is above or equal to the median of Z then NewCol = "GT EQ Median"
If the Flag = 1 and value of Z is below the median of Z then NewCol = "LT Median"

The median of Z is estimated only for Z values where Flag=1, in this case median(c(54.027, 41.7267, 86.612, 95.831))~ 70.32. So if Flag = 1 and Z >= 54.027 then NewCol = "GT EQ Median", if Flag = 1 and Z < 54.027 then NewCol = "LT Median"

  ID     Flag    Z        NewCol
  121    1       54.027   LT Median
  122    1       41.7267  LT Median
  121    2       26.006   Flag2
  645    1       86.612   GT EQ Median
  217    1       95.831   GT EQ Median
  217    2        6.616   Flag2

I tried this with case_when function but I was not sure how to include the logic to check for median of Z when Flag=1. Any suggestion here is much appreciated. Thanks.

---------- question updated based on Benson and Darren's suggestion.


Solution

  • With the dplyr package, you can specify per group calculation using group_by(), and specify multiple condition with and &.

    Also, the input for median() should be a vector of numeric values, therefore you should wrap your numbers with c(). Otherwise, it would return the first number as the median, which is clearly incorrect.

    median(c(54.027, 41.7, 86.6, 95.8))
    [1] 70.3135
    
    library(dplyr)
    
    df %>% 
      group_by(Flag) %>% 
      mutate(newCol = case_when(Flag == 2 ~ "Flag2",
                                Flag == 1 & Z >= median(Z) ~ "GT EQ Median",
                                Flag == 1 & Z < median(Z) ~ "LT Median")) %>% 
      ungroup()
    
    # A tibble: 6 × 4
         ID  Flag     Z newCol      
      <int> <int> <dbl> <chr>       
    1   121     1 54.0  LT Median   
    2   122     1 41.7  LT Median   
    3   121     2 26.0  Flag2       
    4   645     1 86.6  GT EQ Median
    5   217     1 95.8  GT EQ Median
    6   217     2  6.62 Flag2