rdataframedplyrdata.tablenon-equi-join

Join big dataframe in r and filter in the same time


df1 = data.frame(id=1,start=as.Date("2012-07-05"),end=as.Date("2012-07-15"))

df2 = data.frame(id=rep(1,1371),date = as.Date(as.Date("2012-05-06"):as.Date("2016-02-05")))


output = dplyr::inner_join(x=df1,y=df2,by="id") %>% filter(date>=start & date<= end)

I have two dataframes which have each one about one millions rows and I want to join them by id and then filter so that for each row, value of column date is comprised between value of startdate and enddate.

An dplyr::inner_join is not working because it asks too much memory. For each id the date interval is a lot bigger in df2 than in df1 so thats why an inner_join %>% filter is not efficient, is it possible to do it in the same time?


Solution

  • Non-equi join from data.table or sqldf packages could work a lot faster than dplyr so give them a try

    df1 = data.frame(id = 1, start = as.Date("2012-07-05"), 
                     end = as.Date("2012-07-15"))
    df1
    #>   id      start        end
    #> 1  1 2012-07-05 2012-07-15
    
    df2 = data.frame(id = rep(1, 1371), 
                     date = seq(as.Date("2012-05-06"), as.Date("2016-02-05"), by = "1 day"))
    head(df2)
    #>      id       date
    #> 1     1 2012-05-06
    #> 2     1 2012-05-07
    #> 3     1 2012-05-08
    #> 4     1 2012-05-09
    #> 5     1 2012-05-10
    #> 6     1 2012-05-11
    

    Using sqldf package:

    library(sqldf)
    
    sqldf("SELECT f1.id, start, end, date
          FROM df1 f1, df2 f2
          WHERE f1.id = f2.id AND
          f2.date >= f1.start AND 
          f2.date <= f1.end")
    
    #>    id      start        end       date
    #> 1   1 2012-07-05 2012-07-15 2012-07-05
    #> 2   1 2012-07-05 2012-07-15 2012-07-06
    #> 3   1 2012-07-05 2012-07-15 2012-07-07
    #> 4   1 2012-07-05 2012-07-15 2012-07-08
    #> 5   1 2012-07-05 2012-07-15 2012-07-09
    #> 6   1 2012-07-05 2012-07-15 2012-07-10
    #> 7   1 2012-07-05 2012-07-15 2012-07-11
    #> 8   1 2012-07-05 2012-07-15 2012-07-12
    #> 9   1 2012-07-05 2012-07-15 2012-07-13
    #> 10  1 2012-07-05 2012-07-15 2012-07-14
    #> 11  1 2012-07-05 2012-07-15 2012-07-15
    

    Using non-equi join from data.table package: Benchmark | Video

    library(data.table)
    
    ## convert both data.frames to data.tables by reference
    setDT(df1)
    setDT(df2) 
    
    # join by id and date within start & end limits
    # "x." is used so we can refer to the column in df1 explicitly
    df2[df1, .(id, date = x.date, start, end), 
      on = .(id, date >= start, date <= end)]
    
    #>     id       date      start        end
    #>  1:  1 2012-07-05 2012-07-05 2012-07-15
    #>  2:  1 2012-07-06 2012-07-05 2012-07-15
    #>  3:  1 2012-07-07 2012-07-05 2012-07-15
    #>  4:  1 2012-07-08 2012-07-05 2012-07-15
    #>  5:  1 2012-07-09 2012-07-05 2012-07-15
    #>  6:  1 2012-07-10 2012-07-05 2012-07-15
    #>  7:  1 2012-07-11 2012-07-05 2012-07-15
    #>  8:  1 2012-07-12 2012-07-05 2012-07-15
    #>  9:  1 2012-07-13 2012-07-05 2012-07-15
    #> 10:  1 2012-07-14 2012-07-05 2012-07-15
    #> 11:  1 2012-07-15 2012-07-05 2012-07-15
    

    Created on 2018-03-28 by the reprex package (v0.2.0).