TL;DR: summarize(across(all_of(vars_to_sum), ~ sum(., na.rm=TRUE)))
requires a non NA list of vars_to_sum
, while I'm working inside a function and the user will sometimes ask for no variable to be summarized by a sum. The summarize()
includes other across()
and I'm not sure it is solvable by a simple if()
.
At least that what I think the problem is...
Hi!
I'm working on a project that involves quite a bit of data management/engineering, and there's one step that I struggle with. I'll replicate the problem on fake data.
I'm working on hospital stays. I started from relational databases, and I'm slowly summarizing the information in a format that fits my needs. Where my problem starts, the data will look something like this :
df <- data.frame(
patient_id = c("Anne","Bryan","Bryan","Charlotte","Charlotte","Denis","Denis","Denis"),
entry_date = as.POSIXt("2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-04-01", "2020-05-01", "2020-05-05", "2020-05-25"),
exit_date = as.POSIXt("2020-01-10", "2020-02-02", "2020-02-10", "2020-03-10", "2020-04-10", "2020-05-02", "2020-05-20", "2020-06-10"),
entry_mode = c("home","home","transfer","home","home","transfer",NA,"transfer"),
exit_mode = c("death","transfer","home","discharged","home","death",NA,"transfer"),
drug_A = c(0,0,1,1,0,1,0,0),
drug_B = c(0,1,1,0,0,0,1,1),
drug_C = c(1,2,5,1,0,0,1,0)
)
N.B: the database will have plenty more drug_xxx
variables
My goal here is to develop a function that would
(1) identify patients that did several stays in a short period of time (started a hospital stay soon after the exit of another one, N.B: overlapping stays also exist), since we will consider that all such stays are related to the same medical issue (let's say cardiac arrest). If several hospitals/wards/doctors/... notify a cardiac arrest from the same person several days in a row, it's treated as 1 cardiac arrest, followed by several stays/consultations/... all related to the same cardiac arrest.
(2) summarizing the info of nearby stays by maxing the variables that should be maxed (defined by user), summing what should be summed, considering the earliest or latest datapoint when appropriate, and keeping the rest untouched
The function would consider the following arguments:
Ideally, the function should work even if var_to_xxx arguments are NA, implying there's no variable to be sumrized this way on this occasion.
[EDIT: I initially shared my functional function, mixing dplyr and for-loops, but I think it just made the post less readable, so here goes a function that almost work and that needs fixing]
aggregate_stays <- function (df, individual_id, date_entry, date_exit, allowed_dist=0, vars_to_earliest=NA, vars_to_latest=NA, vars_to_sum=NA, vars_to_max=NA){
require("lubridate",quietly=TRUE)
require("dplyr",quietly=TRUE)
# Distinguishing between single stays and multiple stays
df_single_stays <- df %>%
group_by(!!sym(individual_id)) %>%
filter(n() == 1) %>%
ungroup()
df_multiple_stays <- df %>%
group_by(!!sym(individual_id)) %>%
filter(n() > 1) %>%
ungroup()
# Summarizing multiple stays according to the arguments of the function
df_multiple_stays <- df_multiple_stays %>%
arrange(!!sym(individual_id), !!sym(date_entry)) %>% # Ordering the individuals and stays
group_by(!!sym(individual_id)) %>%
mutate(wholestay_id = cumsum(!!sym(date_entry) - lag(!!sym(date_exit), default = first(!!sym(date_exit))) > allowed_dist)) %>% # Checking if nearby stays
group_by(!!sym(individual_id), wholestay_id) %>%
mutate(wholestay_id = cur_group_id()) %>% # Creating the index of "big stays"
ungroup() %>%
group_by(wholestay_id) %>%
summarize(wholestay_entry = min(!!sym(date_entry)),
wholestay_exit = max(!!sym(date_exit)),
across(all_of(vars_to_earliest), ~ .[which.min(!!sym(date_entry))]),
across(all_of(vars_to_latest), ~ .[which.max(!!sym(date_exit))]),
across(all_of(vars_to_sum), ~ sum(., na.rm=TRUE)),
across(all_of(vars_to_max), ~ max(., na.rm=TRUE))) %>%
ungroup()
# Removing the `wholestay_id` variable that was not asked by user
df_multiple_stays <- df_multiple_stays %>%
dplyr::select(-wholestay_id)
df_final <- bind_rows(df_multiple_stays,df_single_stays)
return(df_final)
}
But it doesn't seem to enjoy NA as argument for vars_to_xxx arguments
df2 <- aggregate_stays_wip(df = df,
individual_id = "patient_id",
date_entry = "entry_date",
date_exit = "exit_date",
allowed_dist = 10,
vars_to_sum = "drug_C",
vars_to_max = c("drug_A","drug_B"),
vars_to_earliest = "entry_mode",
vars_to_latest = c("patient_id","exit_mode"))
works fine
but
df3 <- aggregate_stays(df = df,
individual_id = "patient_id",
date_entry = "entry_date",
date_exit = "exit_date",
allowed_dist = 10,
vars_to_sum = NA,
vars_to_max = c("drug_A","drug_B","drug_C"),
vars_to_earliest = "entry_mode",
vars_to_latest = c("patient_id","exit_mode"))
returns
Error in `summarize()`:
ℹ In argument: `across(all_of(vars_to_sum), ~sum(., na.rm = TRUE))`.
Caused by error in `across()`:
! Selections can't have missing values.
Run `rlang::last_trace()` to see where the error occurred.
If you have any way to solve this, I'd love to hear it!
Best, an epidemiologist who whish he'd be better at coding.
Pass NULL
instead and it then skips over that line of the summarise call.
You can also perhaps make things easier by passing NULL
as default and using tidy evaluation to avoid having to do quotes:
library(tidyverse)
aggregate_things <- function(
data,
vars_to_sum = NULL,
vars_to_mean = NULL,
vars_to_max = NULL
) {
data |>
summarise(
across({{vars_to_sum}}, sum),
across({{vars_to_mean}}, mean),
across({{vars_to_max}}, max),
)
}
iris |>
aggregate_things(vars_to_sum = c(Sepal.Length, Sepal.Width),
vars_to_max = c(Petal.Length, Petal.Width))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 876.5 458.6 6.9 2.5
NULL
slibrary(tidyverse)
df <- data.frame(
patient_id = c("Anne","Bryan","Bryan","Charlotte","Charlotte","Denis","Denis","Denis"),
entry_date = ymd("2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-04-01", "2020-05-01", "2020-05-05", "2020-05-25"),
exit_date = ymd("2020-01-10", "2020-02-02", "2020-02-10", "2020-03-10", "2020-04-10", "2020-05-02", "2020-05-20", "2020-06-10"),
entry_mode = c("home","home","transfer","home","home","transfer",NA,"transfer"),
exit_mode = c("death","transfer","home","discharged","home","death",NA,"transfer"),
drug_A = c("0","0","1","1","0","1","0","0"),
drug_B = c("0","1","1","0","0","0","1","1"),
drug_C = c("1","2","5","1","0","0","1","0")
) |>
mutate(across(starts_with("drug"), as.integer))
aggregate_stays <-
function (df,
individual_id,
date_entry,
date_exit,
allowed_dist = 0,
vars_to_earliest = NULL,
vars_to_latest = NULL,
vars_to_sum = NULL,
vars_to_max = NULL) {
require("lubridate",quietly=TRUE)
require("dplyr",quietly=TRUE)
# Distinguishing between single stays and multiple stays
df_single_stays <- df %>%
group_by({{individual_id}}) %>%
filter(n() == 1) %>%
ungroup()
df_multiple_stays <- df %>%
group_by({{individual_id}}) %>%
filter(n() > 1) %>%
ungroup()
# Summarizing multiple stays according to the arguments of the function
df_multiple_stays <- df_multiple_stays %>%
arrange({{individual_id}}, {{date_entry}}) %>% # Ordering the individuals and stays
group_by({{individual_id}}) %>%
mutate(wholestay_id = cumsum({{date_entry}} - lag({{date_exit}}, default = first({{date_exit}})) > allowed_dist)) %>% # Checking if nearby stays
group_by({{individual_id}}, wholestay_id) %>%
mutate(wholestay_id = cur_group_id()) %>% # Creating the index of "big stays"
ungroup() %>%
group_by(wholestay_id) %>%
summarize(wholestay_entry = min({{date_entry}}),
wholestay_exit = max({{date_exit}}),
across({{vars_to_earliest}}, ~ .[which.min({{date_entry}})]),
across({{vars_to_latest}}, ~ .[which.max({{date_exit}})]),
across({{vars_to_sum}}, ~ sum(., na.rm=TRUE)),
across({{vars_to_max}}, ~ max(., na.rm=TRUE))) %>%
ungroup()
# Removing the `wholestay_id` variable that was not asked by user
df_multiple_stays <- df_multiple_stays %>%
dplyr::select(-wholestay_id)
df_final <- bind_rows(df_multiple_stays,df_single_stays)
return(df_final)
}
aggregate_stays(df = df,
individual_id = patient_id,
date_entry = entry_date,
date_exit = exit_date,
allowed_dist = 10,
vars_to_sum = drug_C,
vars_to_max = c(drug_A,drug_B),
vars_to_earliest = entry_mode,
vars_to_latest = c(patient_id,exit_mode))
#> # A tibble: 5 × 10
#> wholestay_entry wholestay_exit entry_mode patient_id exit_mode drug_C drug_A
#> <date> <date> <chr> <chr> <chr> <int> <int>
#> 1 2020-02-01 2020-02-10 home Bryan home 7 1
#> 2 2020-03-01 2020-03-10 home Charlotte discharged 1 1
#> 3 2020-04-01 2020-04-10 home Charlotte home 0 0
#> 4 2020-05-01 2020-06-10 transfer Denis transfer 1 1
#> 5 NA NA home Anne death 1 0
#> # ℹ 3 more variables: drug_B <int>, entry_date <date>, exit_date <date>
aggregate_stays(df = df,
individual_id = patient_id,
date_entry = entry_date,
date_exit = exit_date,
allowed_dist = 10,
vars_to_max = c(drug_A,drug_B,drug_C),
vars_to_earliest = entry_mode,
vars_to_latest = c(patient_id,exit_mode))
#> # A tibble: 5 × 10
#> wholestay_entry wholestay_exit entry_mode patient_id exit_mode drug_A drug_B
#> <date> <date> <chr> <chr> <chr> <int> <int>
#> 1 2020-02-01 2020-02-10 home Bryan home 1 1
#> 2 2020-03-01 2020-03-10 home Charlotte discharged 1 0
#> 3 2020-04-01 2020-04-10 home Charlotte home 0 0
#> 4 2020-05-01 2020-06-10 transfer Denis transfer 1 1
#> 5 NA NA home Anne death 0 0
#> # ℹ 3 more variables: drug_C <int>, entry_date <date>, exit_date <date>
This should now take your data and give the intended result, with the added advantage of if you pipe your dataframe into the function in RStudio then tab autocomplete will look up the columns in your dataframe for you and pass them as unquoted column names.