I know this may be a simple question but I cant seem to get it right.
I have two data tables data table old_dt and data table new_dt. Both data tables has two similar columns. My goal is to get the rows from new_dt that is not in old_dt.
Here is an example. Old_dt
v1 v2
1 a
2 b
3 c
4 d
Here is new_dt
v1 v2
3 c
4 d
5 e
What I want is to get just the 5 e row.
Using setdiff didn't work because my real data is more than 3 million rows. Using subset like this
sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)
Only resulted in new_dt itself.
Using
sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)
Reulted in nothing.
Using
sub.cti <- new_dt[,.(!old_dt$v1, !old_dt$v2)]
Reulted in multiple rows of FALSEs
Can somebody help me?
Thank you in advance
We can do a join
(data from @giraffehere's post)
df2[!df1, on = "a"]
# a b
#1: 6 14
#2: 7 15
To get rows in 'df1' that are not in 'df2' based on the 'a' column
df1[!df2, on = "a"]
# a b
#1: 4 9
#2: 5 10
In the OP's example we need to join on
both columns
new_dt[!old_dt, on = c("v1", "v2")]
# v1 v2
#1: 5 e
NOTE: Here I assumed the 'new_dt' and 'old_dt' as data.table
s.
Of course, dplyr
is a good package. For dealing with this problem, a shorter anti_join
can be used
library(dplyr)
anti_join(new_dt, old_dt)
# v1 v2
# (int) (chr)
#1 5 e
or the setdiff
from dplyr
can work on data.frame
, data.table
, tbl_df
etc.
setdiff(new_dt, old_dt)
# v1 v2
#1: 5 e
However, the question is tagged as data.table
.