rjoindplyrdbplyrdtplyr

Non-equi join in tidyverse


I was wondering whether someone knows if the dplyr extension packages (dbplyr and dtplyr) allow non-equi joins within the usual dplyr workflow? I rarely need data.table, but fast non-equi joins are the only moments where I always need to setDT, then perform the join and then switch back with as_tibble(). I scrolled through the issues in the packages on github but didn't find whether this is something that's planned or already implemented.


Solution

  • Non-equi joins are available since 1.1.0 via the function join_by. To create non-equi joins, you can use <, >, >=, <=, or helpers between, within, overlaps and nearest.

    library(dplyr)
    
    #Example from https://github.com/tidyverse/dplyr/pull/5910.
    
    set.seed(123)
    dates <- as.Date("2019-01-01") + 0:4
    needles <- tibble(dates = dates, x = sample(length(dates)))
    
    set.seed(123)
    lower <- as.Date("2019-01-01") + sample(6, 5, replace = TRUE)
    upper <- lower + sample(2, 5, replace = TRUE)
    haystack <- tibble(lower = lower, upper = upper, y = sample(length(lower)))
    
    needles
    #> # A tibble: 5 x 2
    #>   dates          x
    #>   <date>     <int>
    #> 1 2019-01-01     3
    #> 2 2019-01-02     2
    #> 3 2019-01-03     5
    #> 4 2019-01-04     4
    #> 5 2019-01-05     1
    haystack
    #> # A tibble: 5 x 3
    #>   lower      upper          y
    #>   <date>     <date>     <int>
    #> 1 2019-01-04 2019-01-06     1
    #> 2 2019-01-07 2019-01-08     2
    #> 3 2019-01-04 2019-01-05     3
    #> 4 2019-01-03 2019-01-05     4
    #> 5 2019-01-03 2019-01-05     5
    
    # Non-equi join
    # For each row in `needles`, find locations in `haystack` matching the condition
    left_join(needles, haystack, by = join_by(dates >= lower, dates <= upper))
    #> # A tibble: 12 x 5
    #>    dates          x lower      upper          y
    #>    <date>     <int> <date>     <date>     <int>
    #>  1 2019-01-01     3 NA         NA            NA
    #>  2 2019-01-02     2 NA         NA            NA
    #>  3 2019-01-03     5 2019-01-03 2019-01-05     4
    #>  4 2019-01-03     5 2019-01-03 2019-01-05     5
    #>  5 2019-01-04     4 2019-01-04 2019-01-06     1
    #>  6 2019-01-04     4 2019-01-04 2019-01-05     3
    #>  7 2019-01-04     4 2019-01-03 2019-01-05     4
    #>  8 2019-01-04     4 2019-01-03 2019-01-05     5
    #>  9 2019-01-05     1 2019-01-04 2019-01-06     1
    #> 10 2019-01-05     1 2019-01-04 2019-01-05     3
    #> 11 2019-01-05     1 2019-01-03 2019-01-05     4
    #> 12 2019-01-05     1 2019-01-03 2019-01-05     5