I read about the collapse package recently and tried to translate the following data.table
code to collapse
to see if it's faster in real world examples.
Here's my data.table
code:
library(data.table)
library(nycflights13)
data("flights")
flights_DT <- as.data.table(flights)
val_var <- "arr_delay"
id_var <- "carrier"
by <- c("month", "day")
flights_DT[
j = list(agg_val_var = sum(abs(get(val_var)), na.rm = TRUE)),
keyby = c(id_var, by)
][
i = order(-agg_val_var),
j = list(value_share = cumsum(agg_val_var)/sum(agg_val_var)),
keyby = by
][
j = .SD[2L],
keyby = by
][
order(-value_share)
]
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-11 by the reprex package (v1.0.0)
I managed to translate the first data.table
call, but struggled later on.
It would be great to see how collapse
would be used to handle this use case.
I think it only makes sense to translate data.table
code to collapse
if
you've come up with an arcane expression in data.table
to do something complex statistical it is is not good at (such as weighted aggregation, computing quantiles or the mode by groups, lagging / differencing an irregular panel, grouped centering or linear / polynomial fitting)
you actually don't need the data.table
object but would much rather work with vectors / matrices / data.frame's / tibbles
you want to write a statistical program and would much prefer standard evaluation programming over NS eval and data.table
syntax or
collapse
is indeed substantially faster for your specific application.
Now to the specific code you have provided. It mixes standard and non-standard evaluation (e.g. through the use of get()
), which is something collapse
is not very good at. I'll give you 3 solutions ranging from full NS eval to full standard eval base R style programming.
library(data.table)
library(nycflights13)
library(magrittr)
library(collapse)
data("flights")
flights_DT <- as.data.table(flights)
# Defining a function for the second aggregation
myFUN <- function(x) (cumsum(x[1:2])/sum(x))[2L]
# Soluting 1: Non-Standard evaluation
flights_DT %>%
fgroup_by(carrier, month, day) %>%
fsummarise(agg_val_var = fsum(abs(arr_delay))) %>%
roworder(month, day, -agg_val_var, na.last = NA) %>%
fgroup_by(month, day) %>%
fsummarise(value_share = myFUN(agg_val_var)) %>%
roworder(-value_share)
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-12 by the reprex package (v0.3.0)
Note the use of na.last = NA
wich actually removes cases where agg_val_var
is missing. This is needed here because fsum(NA)
is NA
and not 0
like sum(NA, na.rm = TRUE)
. Now the hybrid example which is probably closes to the code you provided:
val_var <- "arr_delay"
id_var <- "carrier"
by <- c("month", "day")
# Solution 2: Hybrid approach with standard eval and magrittr pipes
flights_DT %>%
get_vars(c(id_var, val_var, by)) %>%
ftransformv(val_var, abs) %>%
collapv(c(id_var, by), fsum) %>%
get_vars(c(by, val_var)) %>%
roworderv(decreasing = c(FALSE, FALSE, TRUE), na.last = NA) %>%
collapv(by, myFUN) %>%
roworderv(val_var, decreasing = TRUE) %>%
frename(replace, names(.) == val_var, "value_share")
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-12 by the reprex package (v0.3.0)
Note here that I used frename
at the end to give the result column the name you wanted, as you cannot mix standard and non-standard eval in the same function in collapse
. Finally, a big advantage of collapse
is that you can use it for pretty low-level programming:
# Solution 3: Programming
data <- get_vars(flights_DT, c(id_var, val_var, by))
data[[val_var]] <- abs(.subset2(data, val_var))
g <- GRP(data, c(id_var, by))
data <- add_vars(get_vars(g$groups, by),
fsum(get_vars(data, val_var), g, use.g.names = FALSE))
data <- roworderv(data, decreasing = c(FALSE, FALSE, TRUE), na.last = NA)
g <- GRP(data, by)
columns
data <- add_vars(g$groups, list(value_share = BY(.subset2(data, val_var), g, myFUN, use.g.names = FALSE)))
data <- roworderv(data, "value_share", decreasing = TRUE)
data
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-12 by the reprex package (v0.3.0)
I refer you to the blog post on programming with collapse
for a more interesting example on how this can benefit the development of statistical codes.
Now for the evaluation, I wrapped these solutions in functions, where DT()
is the data.table
code you provided, run with 2 threads on a windows machine. This checks equality:
all_obj_equal(DT(), clp_NSE(), clp_Hybrid(), clp_Prog())
#> TRUE
Now the benchmark:
library(microbenchmark)
microbenchmark(DT(), clp_NSE(), clp_Hybrid(), clp_Prog())
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> DT() 85.81079 87.80887 91.82032 89.47025 92.54601 132.26073 100 b
#> clp_NSE() 13.47535 14.15744 15.99264 14.80606 16.29140 28.16895 100 a
#> clp_Hybrid() 13.79843 14.23508 16.61606 15.00196 16.83604 32.94648 100 a
#> clp_Prog() 13.71320 14.17283 16.16281 14.94395 16.16935 39.24706 100 a