rdplyrdata.tablefuzzyjoin

interval join with extra key


I would like to do an interval join with an additional key. The simplest way in dplyr is quite slow

intervalDf <- tibble(id = rep(seq(1, 100000, 1), 10),
                     k1 = rep(seq(1, 1000, 1), 1000),
                     startTime = sample(seq(as.Date('1995/01/01'), as.Date('1999/06/01'), by="day"), 1000000, replace = TRUE),
                     endTime = startTime + sample.int(180, 1000000, replace = TRUE))
eventDf <- tibble(k1 = rep(seq(1, 1000, 1), 200),
                  points = sample.int(10, 200000, replace = TRUE),
                  date = sample(seq(as.Date('1995/01/01'), as.Date('2000/01/01'), by="day"), 200000, replace = TRUE))
testDf <- inner_join(intervalDf, eventDf, by = "k1") %>%
  filter(date >= startTime,
         date <= endTime) %>%
  count(id, startTime, endTime, wt = points, name = "points")

I was hoping to use interval_join in the fuzzyjoin package, but that function does not allow for an additional join key besides the interval https://rdrr.io/cran/fuzzyjoin/man/interval_join.html.

I'm guessing there's some slick way with rolling join in data.table or something, but I can't figure it out.


Solution

  • we should use a non-equi join here, with slight modification to the approach outlined in the comments:

    library(data.table)
    
    setDT(eventDf)
    setDT(intervalDf)
    
    eventDf[intervalDf, on=.(k1, date>=startTime, date<=endTime)][, sum(points), .(id, startTime=date, endTime=date.1)]