rdata.tablenon-equi-join

data.table non-equi join with min/max of joined value


I'm trying to do a non-equi join in data.table and extract the min/max of the joined values in that join.

set.seed(42)
dtA <- data.table(id=rep(c("A","B"),each=3), start=rep(1:3, times=2), end=rep(2:4, times=2))
dtB <- data.table(id=rep(c("A","B"),times=20), time=sort(runif(40, 1, 4)))

I'd like to preserve the min/max values of time when it is between start and end (and on id). Nominally this is just a non-equi join, but I can't find a combination of by=.EACHI or mult="..." that gets me what I want. Instead, the min/max typically are not aligned with the range I need. Unfortunately roll= does not support non-equi ranges.

dtA[dtB, c("Min", "Max") := .(min(time), max(time)),
    on = .(id, start <= time, end > time), mult = "first"]
#        id start   end      Min      Max
#    <char> <int> <int>    <num>    <num>
# 1:      A     1     2 1.011845 3.966675
# 2:      A     2     3 1.011845 3.966675
# 3:      A     3     4 1.011845 3.966675
# 4:      B     1     2 1.011845 3.966675
# 5:      B     2     3 1.011845 3.966675
# 6:      B     3     4 1.011845 3.966675
dtA[dtB, c("Min", "Max") := .(min(time), max(time)),
    on = .(id, start <= time, end > time), by = .EACHI]
#        id start   end      Min      Max
#    <char> <int> <int>    <num>    <num>
# 1:      A     1     2 1.858419 1.858419
# 2:      A     2     3 2.970977 2.970977
# 3:      A     3     4 3.934679 3.934679
# 4:      B     1     2 1.766286 1.766286
# 5:      B     2     3 2.925237 2.925237
# 6:      B     3     4 3.966675 3.966675

The second is the closest ("Max" is correct), but what I'd like to be able to get is:

       id start   end      Min      Max
   <char> <num> <int>    <num>    <num>
1:      A     1     2 1.011845 1.858419
2:      A     2     3 2.170610 2.970977
3:      A     3     4 3.115194 3.934679
4:      B     1     2 1.022002 1.766286
5:      B     2     3 2.164325 2.925237
6:      B     3     4 3.055509 3.966675

The real problem has around 400K or so rows with ranges joining in 2Mi rows of values, so I'd prefer to not do a full expansion of both frames to manually cut it back down to the size of dtA.

(I'm open to collapse suggestions.)


Solution

  • Switch the join around so it's B[A], and then assign inside A:

    dtA[,
        c("min","max") := dtB[
            dtA,
            on=.(id, time >= start, time < end), 
            .(min=min(x.time), max=max(x.time)),
            by=.EACHI][, c("min","max")]
        ]
    dtA
    
    #       id start   end      min      max
    #   <char> <int> <int>    <num>    <num>
    #1:      A     1     2 1.011845 1.858419
    #2:      A     2     3 2.170610 2.970977
    #3:      A     3     4 3.115194 3.934679
    #4:      B     1     2 1.022002 1.766286
    #5:      B     2     3 2.164325 2.925237
    #6:      B     3     4 3.055509 3.966675
    

    You can see it needs to be spun around, or otherwise the .EACHI group ends up being for each individual row in B instead of the matching rows in B inside the criteria for A:

    dtB[dtA, on=.(id, time >= start, time < end), .N, by=.EACHI]
    #       id  time  time     N
    #   <char> <int> <int> <int>
    #1:      A     1     2     5
    #2:      A     2     3     6
    #3:      A     3     4     9
    #4:      B     1     2     4
    #5:      B     2     3     6
    #6:      B     3     4    10
    
    dtA[dtB, on=.(id, start <= time, end > time), .N, by=.EACHI][, .(freq=.N), by=N]
    #       N  freq
    #   <int> <int>
    #1:     1    40
    

    This makes sense in the context of the help("data.table::special-symbols") description:

    Its usage is 'by=.EACHI' (or 'keyby=.EACHI') which invokes grouping-by-each-row-of-i

    In the DT[i, j, by] logic, dtA then needs to contribute the rows for the grouping.