rjointimedata.tablenon-equi-join

Join by overlapping periods while operating for some of the values


I'm trying to join one database of periods like this one:

id = c(rep(1,3), rep(2,3), rep(3,3))
start = as.Date(c("2014-07-01", "2015-03-12", "2016-08-13", "2014-07-01", "2015-03-12", "2016-08-13", "2014-07-01", "2015-03-12", "2016-08-13"))
end = as.Date(c("2015-03-11", "2015-08-12", "2018-12-31", "2015-03-11", "2015-08-12", "2018-12-31","2015-03-11", "2015-08-12", "2018-12-31"))

DT = data.table(id, start, end)

DT

   id      start        end
1:  1 2014-07-01 2015-03-11
2:  1 2015-03-12 2015-08-12
3:  1 2016-08-13 2018-12-31
4:  2 2014-07-01 2015-03-11
5:  2 2015-03-12 2015-08-12
6:  2 2016-08-13 2018-12-31
7:  3 2014-07-01 2015-03-11
8:  3 2015-03-12 2015-08-12
9:  3 2016-08-13 2018-12-31

with one that has clinical registries (weight and height) like this one:

id_clin = (c(rep(1,2), rep (2,3), rep(3,4)))
date = as.Date(c("2014-10-23", "2016-09-01", "2017-01-01", "2014-08-01", "2015-02-01", "2017-06-01", "2018-03-05", "2018-09-01", "2018-11-30"))
weight = c(60, 65, 62, 75, 68, 90 , 102, 104 , 98 )
height = c(160, 160, 170, 175, 170, 200, 200, 200 ,200)

DT_clin = data.table(id_clin, date, weight, height)

DT_clin

   id_clin       date weight height
1:       1 2014-10-23     60    160
2:       1 2016-09-01     65    160
3:       2 2017-01-01     62    170
4:       2 2014-08-01     75    175
5:       2 2015-02-01     68    170
6:       3 2017-06-01     90    200
7:       3 2018-03-05    102    200
8:       3 2018-09-01    104    200
9:       3 2018-11-30     98    200

The desired outcome would look something like this*:

   id      start        end       date       date2       weight       height
1:  1 2014-07-01 2015-03-11 2014-10-23  2014-10-23         60.0        160.0
2:  1 2015-03-12 2015-08-12       <NA>        <NA>           NA           NA
3:  1 2016-08-13 2018-12-31 2016-09-01  2016-09-01         65.0        160.0
4:  2 2014-07-01 2015-03-11 2014-08-01  2015-02-01         71.5        172.5
5:  2 2015-03-12 2015-08-12       <NA>        <NA>           NA           NA
6:  2 2016-08-13 2018-12-31 2017-01-01  2017-01-01         62.0        170.0
7:  3 2014-07-01 2015-03-11       <NA>        <NA>           NA           NA
8:  3 2015-03-12 2015-08-12       <NA>        <NA>           NA           NA
9:  3 2016-08-13 2018-12-31 2018-03-05  2018-11-30        101.3        200.0

Also, if there is a way to make more than one operation for the different variables I will also be interested to know a way. (ex. calculate the mean of weight and the max value of height at the same time i make the join)

I've tested foverlaps with good results when there is only one value, but I'm unable to acomplish my objective when there is more than one value that overlap:

setkey(DT, id, start, end)
setkey(DT_clin, id_clin, date, date2)

foverlaps(DT[id == "1", ], DT_clin[id == "1",], by.x =c("id", "start", "end") , by.y = c("id_clin", "date", "date2" ), nomatch = NA )

Should I use Non-Equi-Joins?

*I duplicated date to create date2 and faked an interval of time


Solution

  • With foverlaps:

    library(data.table)
    setkey(DT_clin, id_clin, date, date2)
    
    foverlaps(DT, DT_clin, 
              by.x =c("id", "start", "end"), 
              by.y = c("id_clin", "date", "date2" ), nomatch = NA )[
              ,.(datemin = min(date),
                 datemax = max(date),
                 weight  = mean(weight,na.r=T),
                 height  = mean(height,na.rm=T)),
               by=.(id,start,end)]
    
       id      start        end    datemin    datemax weight height
    1:  1 2014-07-01 2015-03-11 2014-10-23 2014-10-23   60.0  160.0
    2:  1 2015-03-12 2015-08-12       <NA>       <NA>    NaN    NaN
    3:  1 2016-08-13 2018-12-31 2016-09-01 2016-09-01   65.0  160.0
    4:  2 2014-07-01 2015-03-11 2014-08-01 2015-02-01   71.5  172.5
    5:  2 2015-03-12 2015-08-12       <NA>       <NA>    NaN    NaN
    6:  2 2016-08-13 2018-12-31 2017-01-01 2017-01-01   62.0  170.0
    7:  3 2014-07-01 2015-03-11       <NA>       <NA>    NaN    NaN
    8:  3 2015-03-12 2015-08-12       <NA>       <NA>    NaN    NaN
    9:  3 2016-08-13 2018-12-31 2017-06-01 2018-11-30   98.5  200.0