rdata.tablesubset

How do I subset a data table row in R to get the rows unique to itself


I know this may be a simple question but I cant seem to get it right.

I have two data tables data table old_dt and data table new_dt. Both data tables has two similar columns. My goal is to get the rows from new_dt that is not in old_dt.

Here is an example. Old_dt

v1 v2
1 a
2 b
3 c
4 d

Here is new_dt

v1 v2
3 c
4 d
5 e

What I want is to get just the 5 e row.

Using setdiff didn't work because my real data is more than 3 million rows. Using subset like this

sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)

Only resulted in new_dt itself.

Using

sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)

Reulted in nothing.

Using

sub.cti <- new_dt[,.(!old_dt$v1, !old_dt$v2)]

Reulted in multiple rows of FALSEs

Can somebody help me?

Thank you in advance


Solution

  • We can do a join (data from @giraffehere's post)

    df2[!df1, on = "a"]
    #   a  b
    #1: 6 14
    #2: 7 15
    

    To get rows in 'df1' that are not in 'df2' based on the 'a' column

    df1[!df2, on = "a"]
    #   a  b
    #1: 4  9
    #2: 5 10
    

    In the OP's example we need to join on both columns

    new_dt[!old_dt, on = c("v1", "v2")]
    #   v1 v2
    #1:  5  e
    

    NOTE: Here I assumed the 'new_dt' and 'old_dt' as data.tables.


    Of course, dplyr is a good package. For dealing with this problem, a shorter anti_join can be used

    library(dplyr)
    anti_join(new_dt, old_dt)
    #     v1    v2
    #   (int) (chr)
    #1     5     e
    

    or the setdiff from dplyr can work on data.frame, data.table, tbl_df etc.

    setdiff(new_dt, old_dt)
    #   v1 v2
    #1:  5  e
    

    However, the question is tagged as data.table.