rdata.tablenon-equi-join

Non-equi join of dates using data table


I have a data table of edits:

library(data.table)

edits <- data.table(proposal=c('A','A','A'),
           editField=c('probability','probability','probability'),
           startDate=as.POSIXct(c('2017-04-14 00:00:00','2019-09-06 12:12:00','2018-10-10 15:47:00')),
           endDate=as.POSIXct(c('2019-09-06 12:12:00','2018-10-10 15:47:00','9999-12-31 05:00:00')),
           value=c(.1,.3,.1))

   proposal   editField           startDate             endDate value
1:        A probability 2017-04-14 00:00:00 2019-09-06 12:12:00   0.1
2:        A probability 2019-09-06 12:12:00 2018-10-10 15:47:00   0.3
3:        A probability 2018-10-10 15:47:00 9999-12-31 05:00:00   0.1

That I would like to join to a data table of events:

events <-     data.table(proposal='A',
                  editDate=as.POSIXct(c('2017-04-14 00:00:00','2019-09-06 12:12:00','2019-09-06 12:12:00','2019-09-06 12:12:00','2018-07-04 15:33:59','2018-07-27 08:01:00','2018-10-10 15:47:00','2018-10-10 15:47:00','2018-10-10 15:47:00','2018-11-26 11:10:00','2019-02-05 13:06:59')),
                  editField=c('Created','stage','probability','estOrder','estOrder','estOrder','stage','probability','estOrder','estOrder','estOrder'))

    proposal            editDate   editField
 1:        A 2017-04-14 00:00:00     Created
 2:        A 2019-09-06 12:12:00       stage
 3:        A 2019-09-06 12:12:00 probability
 4:        A 2019-09-06 12:12:00    estOrder
 5:        A 2018-07-04 15:33:59    estOrder
 6:        A 2018-07-27 08:01:00    estOrder
 7:        A 2018-10-10 15:47:00       stage
 8:        A 2018-10-10 15:47:00 probability
 9:        A 2018-10-10 15:47:00    estOrder
10:        A 2018-11-26 11:10:00    estOrder
11:        A 2019-02-05 13:06:59    estOrder

To get an output that looks like this, where the value specifies the value of the probability at the time the edit took place:

desired.join <- cbind(events, value=c(.1,.3,.3,.3,.3,.3,.3,.1,.1,.1,.1))
    proposal            editDate   editField value
 1:        A 2017-04-14 00:00:00     Created   0.1
 2:        A 2019-09-06 12:12:00       stage   0.3
 3:        A 2019-09-06 12:12:00 probability   0.3
 4:        A 2019-09-06 12:12:00    estOrder   0.3
 5:        A 2018-07-04 15:33:59    estOrder   0.3
 6:        A 2018-07-27 08:01:00    estOrder   0.3
 7:        A 2018-10-10 15:47:00       stage   0.3
 8:        A 2018-10-10 15:47:00 probability   0.1
 9:        A 2018-10-10 15:47:00    estOrder   0.1
10:        A 2018-11-26 11:10:00    estOrder   0.1
11:        A 2019-02-05 13:06:59    estOrder   0.1

This is what I have so far to try to join the two:

edits[editField=='probability'][events, on=.(proposal, startDate<=editDate, endDate>editDate)]

However when I attempt this, I get an error message reading,"Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 16 rows; more than 14 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice."


Solution

  • It looks like you are trying to join edits and events so that a probability value from the edits data table is associated with the correct observation from the events data table.

    It looks like the error is ocuring because the time intervals used to create the edits data table are not mutually exclusive. When I modify the time intervals to what I think you intended, then your code gives the result that you were looking for.

    library(data.table)
    
    edits <- data.table(proposal=c('A','A','A'),
        editField=c('probability','probability','probability'),
        startDate=as.POSIXct(c('2017-04-14 00:00:00','2018-10-10 15:47:00','2019-09-06 12:12:00')),
        endDate=as.POSIXct(c('2018-10-10 15:47:00','2019-09-06 12:12:00','9999-12-31 05:00:00')),
        value=c(.1,.3,.1))
    
    events <- data.table(proposal='A',
        editDate=as.POSIXct(c('2017-04-14 00:00:00','2019-09-06 12:12:00','2019-09-06 12:12:00','2019-09-06 12:12:00','2018-07-04 15:33:59','2018-07-27 08:01:00','2018-10-10 15:47:00','2018-10-10 15:47:00','2018-10-10 15:47:00','2018-11-26 11:10:00','2019-02-05 13:06:59')),
        editField=c('Created','stage','probability','estOrder','estOrder','estOrder','stage','probability','estOrder','estOrder','estOrder'))
    
    edits[editField=='probability'][events, on=.(proposal, startDate<=editDate, endDate>editDate)]
    

    or you can do the join with out chaining it

      edits[events, on=.(proposal, startDate<=editDate, endDate>editDate)]
    

    or you could do as Jonny Phelps suggested and use foverlaps, but this also requires mutually exclusive time intervals in the edits data table

    events[,startDate:= editDate]
    
    setkey(events, startDate, editDate)
    
    setkey(edits, startDate, endDate)
    
    foverlaps(events, edits, type="any", mult="first")