rfunctiondata.tablerowwise

R data table: how can I efficiently pass data row-wise to a function, using a vector of column names?


I've got some data.tables with patient data from a number of exams over the years, as well as a column with the year in which some event happened (let's say a heart attack). Here's a toy example:

dt <- data.table(patient_ID = c("A","B","C","D"),
                 heart_attack_year = c(2010, 2011, 2016, 2011),
                 exam_1_year = c(2005, 2001, 2002, 2002),
                 exam_2_year = c(2010, 2006, 2010, 2007),
                 exam_3_year = c(2015, 2012, 2011, 2012),
                 exam_4_year = c(2020, 2017, 2017, 2014))

I want to know which exam is closest in time to the heart attack for each patient. I've written a function that takes the target year and a list of the years of the different exams (v) as arguments, and returns a label for the closest exam:

which_exam_is_closest <- function(target_year, exam_labels, v){
  return(exam_labels[which.min(abs(target_year - unlist(v)))])
}

For different data sets, there can be different numbers of exams. The column names for the exam years also differ among data sets. I'd therefore like to be able to refer to the column names with a vector of strings that can be different for each data set.

A for loop can get the list v and call the function for each row of each data set. For the toy data set above, for example, I could write:

exams <- 1:4

closest_exam <- character(0)
for (i in 1:nrow(dt)){
  closest_exam <- c(closest_exam, which_exam_is_closest(dt$heart_attack_year[i], 
                                                        exams, 
                                                        as.list(dt[i,paste0("exam_",exams,"_year"), with = F])))
}
dt$closest <- closest_exam

This gets the desired result:

patient_ID     heart_attack_year exam_1_year exam_2_year exam_3_year exam_4_year closest
A              2010              2005        2010        2015        2020        2
B              2011              2001        2006        2012        2017        3
C              2016              2002        2010        2011        2017        4
D              2011              2002        2007        2012        2014        3

However, using the for loop seems really ugly and inefficient. Is there a better way to pass the list of relevant data for each patient to the function, using the vector of column names, and getting back the column "closest"?


Solution

  • An opportunity for max.col():

    dt[, 
       closest := max.col(sapply(.SD, \(x) -abs(x - heart_attack_year))), 
       .SDcols = patterns("exam_[0-9]_year")]
    #    patient_ID heart_attack_year exam_1_year exam_2_year exam_3_year exam_4_year closest
    #        <char>             <num>       <num>       <num>       <num>       <num>   <int>
    # 1:          A              2010        2005        2010        2015        2020       2
    # 2:          B              2011        2001        2006        2012        2017       3
    # 3:          C              2016        2002        2010        2011        2017       4
    # 4:          D              2011        2002        2007        2012        2014       3
    

    Perhaps the data feels better in two tables? This would allow the useof data.table joins:

    ATTACKS <- dt[, .(patient_ID, heart_attack_year)]
    EXAMS   <- dt[, .SD, .SDcols = patterns("ID|exam_[0-9]_year")
                  ][, melt(.SD, id.vars = "patient_ID")]
    ATTACKS[,  closest_exam := EXAMS[.SD, 
                                     on = .(patient_ID, value = heart_attack_year), 
                                     variable, 
                                     roll = "nearest"]]
    #    patient_ID heart_attack_year closest_exam
    #        <char>             <num>       <fctr>
    # 1:          A              2010  exam_2_year
    # 2:          B              2011  exam_3_year
    # 3:          C              2016  exam_4_year
    # 4:          D              2011  exam_3_year
    

    List columns can also be useful:

    vars <- sprintf("exam_%d_year", 1:4)
    dt[, year_exams := list(.(unlist(.SD))), .SDcols = vars, by = patient_ID]
    dt[, (vars) := NULL]
    dt[, closest := which.min(abs(heart_attack_year - unlist(year_exams)) ), by = patient_ID]
    
    #   patient_ID heart_attack_year          year_exams closest
    #        <char>             <num>              <list>   <int>
    # 1:          A              2010 2005,2010,2015,2020       2
    # 2:          B              2011 2001,2006,2012,2017       3
    # 3:          C              2016 2002,2010,2011,2017       4
    # 4:          D              2011 2002,2007,2012,2014       3