rdplyrtidyrrowwise

Using mutate to paste a column name if a condition is matched


Let's say I work in a psychological context and I'm wondering how many risk factors a patient has. After that, I would like to list all the risks and then discover the most prevalent risk (mode). I'm thinking on use mutate and then paste0 and get the colname if the value of the row is "risk". However, I'm having a hard time with that. any help is appreaciated.

risk

Code is below:

library(tidyverse)
df = data.frame(
  patient = seq(1:60),
  cancer = c("risk","ok"), 
  blood_pres = c("risk", "ok"),
  low_education = c("risk","ok")
) 

df = df %>% mutate(how_many_risks =
                     rowSums(. == "risk"))

Solution

  • Let's come up with some more interesting data.

    set.seed(43)
    df <- data.frame(patient = 1:10, cancer = sample(c("risk","ok"), size=10, replace=TRUE), blood_pres = sample(c("risk","ok"), size=10, replace=TRUE), low_education = sample(c("risk","ok"), size=10, replace=TRUE))
    df
    #    patient cancer blood_pres low_education
    # 1        1     ok       risk          risk
    # 2        2     ok       risk          risk
    # 3        3     ok         ok            ok
    # 4        4   risk       risk          risk
    # 5        5     ok         ok          risk
    # 6        6   risk       risk            ok
    # 7        7     ok         ok            ok
    # 8        8     ok       risk            ok
    # 9        9     ok         ok            ok
    # 10      10   risk       risk          risk
    

    From here, we'll pivot, summarize, then join back onto the original data.

    library(dplyr)
    library(tidyr) # pivot_*
    df %>%
      pivot_longer(cols = -patient, values_to = "risk") %>%
      filter(risk == "risk") %>%
      summarize(howmany = n(), risks = toString(name), .by = patient) %>%
      left_join(df, ., by = "patient") %>%
      mutate(howmany = coalesce(howmany, 0))
    #    patient cancer blood_pres low_education howmany                             risks
    # 1        1     ok       risk          risk       2         blood_pres, low_education
    # 2        2     ok       risk          risk       2         blood_pres, low_education
    # 3        3     ok         ok            ok       0                              <NA>
    # 4        4   risk       risk          risk       3 cancer, blood_pres, low_education
    # 5        5     ok         ok          risk       1                     low_education
    # 6        6   risk       risk            ok       2                cancer, blood_pres
    # 7        7     ok         ok            ok       0                              <NA>
    # 8        8     ok       risk            ok       1                        blood_pres
    # 9        9     ok         ok            ok       0                              <NA>
    # 10      10   risk       risk          risk       3 cancer, blood_pres, low_education
    

    (Note that dplyr_1.1.0 or newer is required to use .by=. If you have an older dplyr and will not update, shift to using group_by(patient) instead of .by=patient.)

    Something you may want to consider: unless this is solely for presentation tables, it is occasionally advantageous to have risks as a list-column instead of a comma-delimited string. To do this, just replace toString with list, and while it may render the same on the console, it will allow things like set-ops on it (though normal column/vector operations may not work as you expect):

    out <- df %>%
      pivot_longer(cols = -patient, values_to = "risk") %>%
      filter(risk == "risk") %>%
      summarize(howmany = n(), risks = list(name), .by = patient) %>%
      left_join(df, ., by = "patient") %>%
      mutate(howmany = coalesce(howmany, 0))
    out
    #    patient cancer blood_pres low_education howmany                             risks
    # 1        1     ok       risk          risk       2         blood_pres, low_education
    # 2        2     ok       risk          risk       2         blood_pres, low_education
    # 3        3     ok         ok            ok       0                              NULL
    # 4        4   risk       risk          risk       3 cancer, blood_pres, low_education
    # 5        5     ok         ok          risk       1                     low_education
    # 6        6   risk       risk            ok       2                cancer, blood_pres
    # 7        7     ok         ok            ok       0                              NULL
    # 8        8     ok       risk            ok       1                        blood_pres
    # 9        9     ok         ok            ok       0                              NULL
    # 10      10   risk       risk          risk       3 cancer, blood_pres, low_education
    

    If this data were a tibble (tbl_df) instead, the same data would present as

    tibble(out)
    # # A tibble: 10 × 6
    #    patient cancer blood_pres low_education howmany risks    
    #      <int> <chr>  <chr>      <chr>           <dbl> <list>   
    #  1       1 ok     risk       risk                2 <chr [2]>
    #  2       2 ok     risk       risk                2 <chr [2]>
    #  3       3 ok     ok         ok                  0 <NULL>   
    #  4       4 risk   risk       risk                3 <chr [3]>
    #  5       5 ok     ok         risk                1 <chr [1]>
    #  6       6 risk   risk       ok                  2 <chr [2]>
    #  7       7 ok     ok         ok                  0 <NULL>   
    #  8       8 ok     risk       ok                  1 <chr [1]>
    #  9       9 ok     ok         ok                  0 <NULL>   
    # 10      10 risk   risk       risk                3 <chr [3]>
    

    We can do things directly such as check the lengths of each row in that column; or check quickly for exact set-membership:

    lengths(out$risks)
    #  [1] 2 2 0 3 1 2 0 1 0 3
    
    sapply(out$risks, `%in%`, x = "cancer")
    #  [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
    

    Granted, both of those can be done with regex, but .. if the names have any ambiguity, regex carries a little overhead.