I've got some data.tables with patient data from a number of exams over the years, as well as a column with the year in which some event happened (let's say a heart attack). Here's a toy example:
dt <- data.table(patient_ID = c("A","B","C","D"),
heart_attack_year = c(2010, 2011, 2016, 2011),
exam_1_year = c(2005, 2001, 2002, 2002),
exam_2_year = c(2010, 2006, 2010, 2007),
exam_3_year = c(2015, 2012, 2011, 2012),
exam_4_year = c(2020, 2017, 2017, 2014))
I want to know which exam is closest in time to the heart attack for each patient. I've written a function that takes the target year and a list of the years of the different exams (v) as arguments, and returns a label for the closest exam:
which_exam_is_closest <- function(target_year, exam_labels, v){
return(exam_labels[which.min(abs(target_year - unlist(v)))])
}
For different data sets, there can be different numbers of exams. The column names for the exam years also differ among data sets. I'd therefore like to be able to refer to the column names with a vector of strings that can be different for each data set.
A for loop can get the list v and call the function for each row of each data set. For the toy data set above, for example, I could write:
exams <- 1:4
closest_exam <- character(0)
for (i in 1:nrow(dt)){
closest_exam <- c(closest_exam, which_exam_is_closest(dt$heart_attack_year[i],
exams,
as.list(dt[i,paste0("exam_",exams,"_year"), with = F])))
}
dt$closest <- closest_exam
This gets the desired result:
patient_ID heart_attack_year exam_1_year exam_2_year exam_3_year exam_4_year closest
A 2010 2005 2010 2015 2020 2
B 2011 2001 2006 2012 2017 3
C 2016 2002 2010 2011 2017 4
D 2011 2002 2007 2012 2014 3
However, using the for loop seems really ugly and inefficient. Is there a better way to pass the list of relevant data for each patient to the function, using the vector of column names, and getting back the column "closest"?
An opportunity for max.col()
:
dt[,
closest := max.col(sapply(.SD, \(x) -abs(x - heart_attack_year))),
.SDcols = patterns("exam_[0-9]_year")]
# patient_ID heart_attack_year exam_1_year exam_2_year exam_3_year exam_4_year closest
# <char> <num> <num> <num> <num> <num> <int>
# 1: A 2010 2005 2010 2015 2020 2
# 2: B 2011 2001 2006 2012 2017 3
# 3: C 2016 2002 2010 2011 2017 4
# 4: D 2011 2002 2007 2012 2014 3
Perhaps the data feels better in two tables? This would allow the useof data.table joins:
ATTACKS <- dt[, .(patient_ID, heart_attack_year)]
EXAMS <- dt[, .SD, .SDcols = patterns("ID|exam_[0-9]_year")
][, melt(.SD, id.vars = "patient_ID")]
ATTACKS[, closest_exam := EXAMS[.SD,
on = .(patient_ID, value = heart_attack_year),
variable,
roll = "nearest"]]
# patient_ID heart_attack_year closest_exam
# <char> <num> <fctr>
# 1: A 2010 exam_2_year
# 2: B 2011 exam_3_year
# 3: C 2016 exam_4_year
# 4: D 2011 exam_3_year
List columns can also be useful:
vars <- sprintf("exam_%d_year", 1:4)
dt[, year_exams := list(.(unlist(.SD))), .SDcols = vars, by = patient_ID]
dt[, (vars) := NULL]
dt[, closest := which.min(abs(heart_attack_year - unlist(year_exams)) ), by = patient_ID]
# patient_ID heart_attack_year year_exams closest
# <char> <num> <list> <int>
# 1: A 2010 2005,2010,2015,2020 2
# 2: B 2011 2001,2006,2012,2017 3
# 3: C 2016 2002,2010,2011,2017 4
# 4: D 2011 2002,2007,2012,2014 3