rdata.tablemicrobenchmarkfuzzyjoin

Why is fuzzyjoin slower than data.table in R


When I want to join two data frames based on two intervals, I prefer to use the fuzzyjoin package because it is easy to read in my opinion. But when I need to work with large datasets, the fuzzyjoin package is a no-go because it is very slow. The data.table package luckily can do the same a lot of times faster and becomes after more practice easier to use. Here is a reproducible example with benchmark:

library(fuzzyjoin)
library(data.table)
library(microbenchmark)

microbenchmark(
  "fuzzyjoin" = {
    fuzzy_left_join(df, df_dates, by = c('group', 'start' = 'from', 'end' = 'to'),
                    match_fun = list(`==`, `<=`, `>=`))
  }, 
  "data.table" = {
    setDT(df)
    setDT(df_dates)
    df_dates[df, .(group, start, end, from, to, value), 
             on = .(group, from >= start, to <= end)]
  }
)
#> Unit: milliseconds
#>        expr       min         lq       mean     median         uq       max
#>   fuzzyjoin 149.17845 155.207436 159.053306 157.323145 159.930656 258.57719
#>  data.table   1.44272   1.618015   1.999427   1.704792   1.934678  20.37736
#>  neval cld
#>    100   b
#>    100  a

Created on 2022-12-15 with reprex v2.0.2

As you can see this simple computation it is already more than 100x times faster. So I was wondering why the fuzzyjoin package is so much slower than the data.table option? Is the data.table option always faster than fuzzyjoin?


df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "B", "B", 
                               "B", "B", "B", "B"), start = structure(c(19327, 19330, 19333, 
                                                                        19336, 19339, 19342, 19327, 19330, 19333, 19336, 19339, 19342
                               ), class = "Date"), end = structure(c(19330, 19333, 19336, 19339, 
                                                                     19342, 19345, 19330, 19333, 19336, 19339, 19342, 19345), class = "Date")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                    -12L))
df_dates <- structure(list(group = c("A", "A", "B", "B"), from = structure(c(19328, 
                                                                             19340, 19332, 19339), class = "Date"), to = structure(c(19329, 
                                                                                                                                     19341, 19333, 19340), class = "Date"), value = c(1, 3, 2, 4)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                        -4L))

Solution

  • Looks like the development version of dplyr (1.0.99.9000) is in this case, on my computer, faster than data.table, and it offers an improved syntax over fuzzyjoin with join_by.

    # added to microbenchmark code in OP
    "dplyr 1.0.99.9000" = {
        left_join(df, df_dates, join_by(group, start <= from, end >= to))
      }
    
    
    Unit: milliseconds
                  expr      min        lq       mean    median        uq      max neval
             fuzzyjoin 131.2089 141.23245 155.888496 147.76050 160.55830 372.0843   100
     dplyr 1.0.99.9000   1.7140   1.84115   2.189630   1.94575   2.09340  13.0027   100
     data.table 1.14.6   2.0335   2.23480   2.868417   2.32665   2.58865  16.7603   100