rmergedata.tabledata-manipulation

Wrong variable comparison result when performing data.table merge of two table with duplicated keys


A collegue trying to do analysis came up with a code from chatgpt, doing something wrong, but that I don't understand.

Here is the example:

Let's consider a first table ( drugs: Patient have an id, and start a drug at x):

library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]

        id     x
    <char> <int>
 1:      A    10
 2:      A     8
 3:      A     8
 4:      B     3
 5:      B     9

Let's consider a second (and main) table (hospital visits, same patients, several hospital stays between two dates y1 and y2) :

df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]

        id    y1    y2 eds_id
    <char> <num> <num>  <int>
 1:      A     2     6      1
 2:      A     4     8      2
 3:      B     2     6      3
 4:      B     4     8      4

Now I want, for each hospital stay, know if any drug was prescribed to the patient during the stay, aka x between y1 and y2, for any drug.

I would do non-equi merge:

df2[df1,xinbetween_true := TRUE,on = .(id,y1 <= x, y2 >= x)]
df2[is.na(xinbetween_true),xinbetween_true := FALSE]

Which work.

ChatGPT came up with:

df2[df1,on = "id",xinbetween := x >= y1 & x <= y2]

Which produce wrong answers:

df2[xinbetween_true != xinbetween]

       id    y1    y2 eds_id xinbetween xinbetween_true
   <char> <num> <num>  <int>     <lgcl>          <lgcl>
1:      B     2     6      3      FALSE            TRUE
2:      C     4     8      6      FALSE            TRUE

For these two entries, the ChatGPT script says no, when it actually has some of the df1 entries respecting the condition:

df2[df1,on = "id",allow.cartesian = T][xinbetween_true != xinbetween]


       id    y1    y2 eds_id xinbetween xinbetween_true     x
   <char> <num> <num>  <int>     <lgcl>          <lgcl> <int>
1:      B     2     6      3      FALSE            TRUE     3
2:      B     2     6      3      FALSE            TRUE     9
3:      B     2     6      3      FALSE            TRUE     9
4:      C     4     8      6      FALSE            TRUE     3
5:      C     4     8      6      FALSE            TRUE     4
6:      C     4     8      6      FALSE            TRUE     3

So is here my question:

What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2] script do? It does not do a proper non-equi merge, but I don't get what it does.

And in what case can it be used?


Solution

  • It's important here that both data.tables have duplicated IDs. Thus, df2[df1, on = "id"] is a cartesian join:

    df1[, rn := as.character(.I)]
    
    df2[df1, on = "id", allow.cartesian = TRUE]
    #        id    y1    y2 eds_id     x     rn
    #    <char> <num> <num>  <int> <int> <char>
    # 1:      A     2     6      1    10      1
    # 2:      A     4     8      2    10      1
    # 3:      A     2     6      1     8      2
    # 4:      A     4     8      2     8      2
    # 5:      A     2     6      1     8      3
    # 6:      A     4     8      2     8      3
    # 7:      B     2     6      3     3      4
    # 8:      B     4     8      4     3      4
    # 9:      B     2     6      3     9      5
    #10:      B     4     8      4     9      5
    #11:      B     2     6      3     9      6
    #12:      B     4     8      4     9      6
    #13:      C     2     6      5     3      7
    #14:      C     4     8      6     3      7
    #15:      C     2     6      5     4      8
    #16:      C     4     8      6     4      8
    #17:      C     2     6      5     3      9
    #18:      C     4     8      6     3      9
    #19:      D     2     6      7    10     10
    #20:      D     4     8      8    10     10
    #21:      D     2     6      7     7     11
    #22:      D     4     8      8     7     11
    #23:      D     2     6      7     5     12
    #24:      D     4     8      8     5     12
    #25:      E     2     6      9    10     13
    #26:      E     4     8     10    10     13
    #27:      E     2     6      9     7     14
    #28:      E     4     8     10     7     14
    #29:      E     2     6      9     6     15
    #30:      E     4     8     10     6     15
    #        id    y1    y2 eds_id     x     rn
    

    It should be elucidating to store the row numbers from df1 that match/are used for the comparison:

    library(data.table)
    df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
    set.seed(125)
    df1[,x := sample(1:10,.N,replace = T)]
    
    df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
    # unique identifier
    df2[,eds_id := 1:.N]
    
    df1[, rn := as.character(.I)]
    df2[df1,xinbetween_true := rn,on = .(id,y1 <= x, y2 >= x)]
    df2[df1,xinbetween := fifelse(x >= y1 & x <= y2, rn, paste0(rn, "-")), on = "id"]
    
    #        id    y1    y2 eds_id xinbetween_true xinbetween
    #    <char> <num> <num>  <int>          <char>     <char>
    # 1:      A     2     6      1            <NA>         3-
    # 2:      A     4     8      2               3          3
    # 3:      B     2     6      3               4         6-
    # 4:      B     4     8      4            <NA>         6-
    # 5:      C     2     6      5               9          9
    # 6:      C     4     8      6               8         9-
    # 7:      D     2     6      7              12         12
    # 8:      D     4     8      8              12         12
    # 9:      E     2     6      9              15         15
    #10:      E     4     8     10              15         15
    

    As you see, the ChatGPT code uses the last row from df1 with a matching ID.