[SOLVED] How do I subset a data table row in R to get the rows unique to itself

How do I subset a data table row in R to get the rows unique to itself

I know this may be a simple question but I cant seem to get it right.

I have two data tables data table old_dt and data table new_dt. Both data tables has two similar columns. My goal is to get the rows from new_dt that is not in old_dt.

Here is an example. Old_dt

v1 v2
1 a
2 b
3 c
4 d

Here is new_dt

v1 v2
3 c
4 d
5 e

What I want is to get just the 5 e row.

Using setdiff didn't work because my real data is more than 3 million rows. Using subset like this

sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)

Only resulted in new_dt itself.

Using

sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)

Reulted in nothing.

Using

sub.cti <- new_dt[,.(!old_dt$v1, !old_dt$v2)]

Reulted in multiple rows of FALSEs

Can somebody help me?

Thank you in advance

Solution

We can do a join (data from @giraffehere's post)

df2[!df1, on = "a"]
#   a  b
#1: 6 14
#2: 7 15

To get rows in 'df1' that are not in 'df2' based on the 'a' column

df1[!df2, on = "a"]
#   a  b
#1: 4  9
#2: 5 10

In the OP's example we need to join on both columns

new_dt[!old_dt, on = c("v1", "v2")]
#   v1 v2
#1:  5  e

NOTE: Here I assumed the 'new_dt' and 'old_dt' as data.tables.

Of course, dplyr is a good package. For dealing with this problem, a shorter anti_join can be used

library(dplyr)
anti_join(new_dt, old_dt)
#     v1    v2
#   (int) (chr)
#1     5     e

or the setdiff from dplyr can work on data.frame, data.table, tbl_df etc.

setdiff(new_dt, old_dt)
#   v1 v2
#1:  5  e

However, the question is tagged as data.table.