Let's say I work in a psychological context and I'm wondering how many risk factors a patient has. After that, I would like to list all the risks and then discover the most prevalent risk (mode). I'm thinking on use mutate
and then paste0
and get the colname
if the value of the row is "risk". However, I'm having a hard time with that.
any help is appreaciated.
Code is below:
library(tidyverse)
df = data.frame(
patient = seq(1:60),
cancer = c("risk","ok"),
blood_pres = c("risk", "ok"),
low_education = c("risk","ok")
)
df = df %>% mutate(how_many_risks =
rowSums(. == "risk"))
Let's come up with some more interesting data.
set.seed(43)
df <- data.frame(patient = 1:10, cancer = sample(c("risk","ok"), size=10, replace=TRUE), blood_pres = sample(c("risk","ok"), size=10, replace=TRUE), low_education = sample(c("risk","ok"), size=10, replace=TRUE))
df
# patient cancer blood_pres low_education
# 1 1 ok risk risk
# 2 2 ok risk risk
# 3 3 ok ok ok
# 4 4 risk risk risk
# 5 5 ok ok risk
# 6 6 risk risk ok
# 7 7 ok ok ok
# 8 8 ok risk ok
# 9 9 ok ok ok
# 10 10 risk risk risk
From here, we'll pivot, summarize, then join back onto the original data.
library(dplyr)
library(tidyr) # pivot_*
df %>%
pivot_longer(cols = -patient, values_to = "risk") %>%
filter(risk == "risk") %>%
summarize(howmany = n(), risks = toString(name), .by = patient) %>%
left_join(df, ., by = "patient") %>%
mutate(howmany = coalesce(howmany, 0))
# patient cancer blood_pres low_education howmany risks
# 1 1 ok risk risk 2 blood_pres, low_education
# 2 2 ok risk risk 2 blood_pres, low_education
# 3 3 ok ok ok 0 <NA>
# 4 4 risk risk risk 3 cancer, blood_pres, low_education
# 5 5 ok ok risk 1 low_education
# 6 6 risk risk ok 2 cancer, blood_pres
# 7 7 ok ok ok 0 <NA>
# 8 8 ok risk ok 1 blood_pres
# 9 9 ok ok ok 0 <NA>
# 10 10 risk risk risk 3 cancer, blood_pres, low_education
(Note that dplyr_1.1.0
or newer is required to use .by=
. If you have an older dplyr and will not update, shift to using group_by(patient)
instead of .by=patient
.)
Something you may want to consider: unless this is solely for presentation tables, it is occasionally advantageous to have risks
as a list-column instead of a comma-delimited string. To do this, just replace toString
with list
, and while it may render the same on the console, it will allow things like set-ops on it (though normal column/vector operations may not work as you expect):
out <- df %>%
pivot_longer(cols = -patient, values_to = "risk") %>%
filter(risk == "risk") %>%
summarize(howmany = n(), risks = list(name), .by = patient) %>%
left_join(df, ., by = "patient") %>%
mutate(howmany = coalesce(howmany, 0))
out
# patient cancer blood_pres low_education howmany risks
# 1 1 ok risk risk 2 blood_pres, low_education
# 2 2 ok risk risk 2 blood_pres, low_education
# 3 3 ok ok ok 0 NULL
# 4 4 risk risk risk 3 cancer, blood_pres, low_education
# 5 5 ok ok risk 1 low_education
# 6 6 risk risk ok 2 cancer, blood_pres
# 7 7 ok ok ok 0 NULL
# 8 8 ok risk ok 1 blood_pres
# 9 9 ok ok ok 0 NULL
# 10 10 risk risk risk 3 cancer, blood_pres, low_education
If this data were a tibble (tbl_df
) instead, the same data would present as
tibble(out)
# # A tibble: 10 × 6
# patient cancer blood_pres low_education howmany risks
# <int> <chr> <chr> <chr> <dbl> <list>
# 1 1 ok risk risk 2 <chr [2]>
# 2 2 ok risk risk 2 <chr [2]>
# 3 3 ok ok ok 0 <NULL>
# 4 4 risk risk risk 3 <chr [3]>
# 5 5 ok ok risk 1 <chr [1]>
# 6 6 risk risk ok 2 <chr [2]>
# 7 7 ok ok ok 0 <NULL>
# 8 8 ok risk ok 1 <chr [1]>
# 9 9 ok ok ok 0 <NULL>
# 10 10 risk risk risk 3 <chr [3]>
We can do things directly such as check the lengths of each row in that column; or check quickly for exact set-membership:
lengths(out$risks)
# [1] 2 2 0 3 1 2 0 1 0 3
sapply(out$risks, `%in%`, x = "cancer")
# [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
Granted, both of those can be done with regex, but .. if the names have any ambiguity, regex carries a little overhead.