When I want to join two data frames based on two intervals, I prefer to use the fuzzyjoin
package because it is easy to read in my opinion. But when I need to work with large datasets, the fuzzyjoin
package is a no-go because it is very slow. The data.table
package luckily can do the same a lot of times faster and becomes after more practice easier to use. Here is a reproducible example with benchmark:
library(fuzzyjoin)
library(data.table)
library(microbenchmark)
microbenchmark(
"fuzzyjoin" = {
fuzzy_left_join(df, df_dates, by = c('group', 'start' = 'from', 'end' = 'to'),
match_fun = list(`==`, `<=`, `>=`))
},
"data.table" = {
setDT(df)
setDT(df_dates)
df_dates[df, .(group, start, end, from, to, value),
on = .(group, from >= start, to <= end)]
}
)
#> Unit: milliseconds
#> expr min lq mean median uq max
#> fuzzyjoin 149.17845 155.207436 159.053306 157.323145 159.930656 258.57719
#> data.table 1.44272 1.618015 1.999427 1.704792 1.934678 20.37736
#> neval cld
#> 100 b
#> 100 a
Created on 2022-12-15 with reprex v2.0.2
As you can see this simple computation it is already more than 100x times faster. So I was wondering why the fuzzyjoin
package is so much slower than the data.table
option? Is the data.table
option always faster than fuzzyjoin
?
df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), start = structure(c(19327, 19330, 19333,
19336, 19339, 19342, 19327, 19330, 19333, 19336, 19339, 19342
), class = "Date"), end = structure(c(19330, 19333, 19336, 19339,
19342, 19345, 19330, 19333, 19336, 19339, 19342, 19345), class = "Date")), class = "data.frame", row.names = c(NA,
-12L))
df_dates <- structure(list(group = c("A", "A", "B", "B"), from = structure(c(19328,
19340, 19332, 19339), class = "Date"), to = structure(c(19329,
19341, 19333, 19340), class = "Date"), value = c(1, 3, 2, 4)), class = "data.frame", row.names = c(NA,
-4L))
Looks like the development version of dplyr (1.0.99.9000) is in this case, on my computer, faster than data.table, and it offers an improved syntax over fuzzyjoin with join_by
.
# added to microbenchmark code in OP
"dplyr 1.0.99.9000" = {
left_join(df, df_dates, join_by(group, start <= from, end >= to))
}
Unit: milliseconds
expr min lq mean median uq max neval
fuzzyjoin 131.2089 141.23245 155.888496 147.76050 160.55830 372.0843 100
dplyr 1.0.99.9000 1.7140 1.84115 2.189630 1.94575 2.09340 13.0027 100
data.table 1.14.6 2.0335 2.23480 2.868417 2.32665 2.58865 16.7603 100