rjoindata.tablenon-equi-join

Non-equi join of data table operation


I'd like to add columns to data table 1 that are operations on data table 2, joining by a variable and where dates from data table 2 are <= the dates from data table 1. I'm looking for a solution that isn't too computationally expensive (I have about 20k rows).

Data table 1 - I have a dataset of proposals, their owners, and their editDates:

proposal_df <- structure(list(proposal = c(41, 62, 169, 72), owner = c("Adam", 
"Adam", "Alan", "Alan"), totalAtEdit = c(-27, 1000, 151, 1137
), editDate = structure(c(1556014200, 1560762240, 1563966600, 
1540832280), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = "data.table", row.names = c(NA, 
-4L))

  proposal owner totalAtEdit            editDate
1       41  Adam         -27 2019-04-23 10:10:00
2       62  Adam        1000 2019-06-17 09:04:00
3      169  Alan         151 2019-07-24 11:10:00
4       72  Alan        1137 2018-10-29 16:58:00

Data table 2 - I have a log of proposals and the date at which they were won or lost (outcome == 1 or 0):

proposal_log <- structure(list(proposal = c(9, 48, 43, 39, 45, 73, 111, 179, 
115, 146), outcome = c(0, 1, 1, 1, 0, 0, 0, 0, 0, 0), owner = c("Adam", 
"Adam", "Adam", "Adam", "Adam", "Alan", "Alan", "Alan", "Alan", 
"Alan"), totalAtEdit = c(2, 2, 4, 566, 100, 1264, 5000, 75, 493, 
18), editDate = structure(c(1557487860, 1561368780, 1561393140, 
1546446240, 1549463520, 1546614180, 1547196960, 1579603560, 1566925200, 
1536751800), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = "data.table", row.names = 
c(NA, 
-10L))

   proposal outcome owner totalAtEdit            editDate
1         9       0  Adam           2 2019-05-10 11:31:00
2        48       1  Adam           2 2019-06-24 09:33:00
3        43       1  Adam           4 2019-06-24 16:19:00
4        39       1  Adam         566 2019-01-02 16:24:00
5        45       0  Adam         100 2019-02-06 14:32:00
6        73       0  Alan        1264 2019-01-04 15:03:00
7       111       0  Alan        5000 2019-01-11 08:56:00
8       179       0  Alan          75 2020-01-21 10:46:00
9       115       0  Alan         493 2019-08-27 17:00:00
10      146       0  Alan          18 2018-09-12 11:30:00

I want to add several columns to proposal_df that are operations on proposal_log, joining by owner and where proposal_log$editDate <= proposal_df$editDate:

Output would look like this:

  proposal owner totalAtEdit            editDate countWon countLost wonValueMean    pctWon
1       41  Adam         -27 2019-04-23 10:10:00        1         1          566 0.5000000
2       62  Adam        1000 2019-06-17 09:04:00        1         2          566 0.3333333
3      169  Alan         151 2019-07-24 11:10:00        0         3          NaN 0.0000000
4       72  Alan        1137 2018-10-29 16:58:00        0         1          NaN 0.0000000

Thanks!


Solution

  • Another option is to use by=.EACHI:

    library(data.table)
    setDT(proposal_df)
    setDT(proposal_log)
    proposal_df[, c("countWon","countLost","wonValueMean","pctWon") := 
        proposal_log[.SD, on=.(owner, editDate<=editDate), by=.EACHI, {
            cw <- sum(outcome==1L)
            .(cw, sum(outcome==0L), mean(x.totalAtEdit[outcome==1L]), cw/.N)
        }][, (1L:2L) := NULL]
    ]