I want to perform several operations intertwining dtplyr
and data.table
code. My question is whether, having loaded dtplyr
, I can apply dplyr
verbs to a data.table
object and get optimized data.table
code as I would with a lazy_dt
.
I here provide some examples and ask: would dtplyr
translate to data.table
code here? Or is simply dplyr
working?
# Setup for all chunks:
library(dplyr)
library(data.table)
library(dtplyr)
a) setDT
dataframe # class data.frame
setDT(dataframe)
dataframe %>%
group_by(id) %>%
mutate(rows_per_group = n())
b) data.table object
dt <- as.data.table(dataframe) # or dt <- data.table::fread(filepath)
dt %>%
group_by(id) %>%
mutate(rows_per_group = n())
Also, if all of them make dtplyr
work. What is the most efficient option between a), b) and c) using lazy_dt(dataframe)
?
I was wondering about similiar question and after reading this post I run some benchmarks. I varied the following
data.table
, dplyr
or dtplyr
tibble
or data.table
The results are:
The results do not confirm that "If you have a data.table, using it with any dplyr generic will automatically convert it to a lazy_dt object" because applying dplyr
function on the data.table
object is much slower than applying dtplyr::lazy_dt()
function. Further, as you can see dtplyr::lazy_dt()
performs faster if you provide a data.table
object (vs. tibble
). But it makes no sense to transform the object from tibble
to data.table
before applying dtplyr::lazy_dt()
on it, because with the time needed for transformation + aplying dtplyr::lazy_dt()
you are as fast if you directly apply dtplyr::lazy_dt()
on a tibble
object (compare results of dtplyr()
and dtplyr_trans()
function where as.data.table(data)
is used at the start to transform the given object to data.table
).
The code I used is
# Data generated as in linked blog post
library(data.table)
library(dplyr)
library(dtplyr)
library(microbenchmark)
library(ggplot2)
N <- 1e7
K <- 100
set.seed(1)
dttbl <- data.table(
id1 = sample(sprintf("id%03d", 1:K), N, TRUE), # large groups (char)
id5 = sample(N / K, N, TRUE), # small groups (int)
v1 = sample(5, N, TRUE), # int in range [1,5]
v2 = sample(5, N, TRUE), # int in range [1,5]
v3 = sample(round(runif(100, max = 100), 4), N, TRUE) # numeric, e. g. 23.5749
)
tbbl <- as_tibble(dttbl)
# data.table method.
dt_fun <- function(data){
data[, lapply(.SD, sum), keyby = id5, .SDcols = 3:5]
}
# dtplyr method with lazy_dt.
dtplyr_fun <- function(data){
data %>%
lazy_dt() %>%
group_by(id5) %>%
summarise_at(vars(v1:v3), sum) %>%
as_tibble()
}
# dtplyr method with lazy_dt where the provided data is transformed
# to data.table first.
dtplyr_trans_fun <- function(data){
data %>%
as.data.table() %>%
lazy_dt() %>%
group_by(id5) %>%
summarise_at(vars(v1:v3), sum) %>%
as_tibble()
}
# dplyr method.
dplyr_fun <- function(data){
data %>%
group_by(id5) %>%
summarise_at(vars(v1:v3), sum) %>%
as_tibble()
}
results <- list(dttbl, tbbl) %>%
lapply(., function(object_i){
if(is.data.table(object_i)){
microbenchmark(
dt= dt_fun(data= object_i),
dtplyr= dtplyr_fun(data= object_i), dtplyr_trans= dtplyr_trans_fun(data= object_i),
dplyr= dplyr_fun(data= object_i),
times= 20) %>%
{data.frame(method= .$expr, time= .$time, class= class(object_i)[1])} %>%
mutate(method= gsub("data = object_i", "", method))
} else{
microbenchmark(
dtplyr= dtplyr_fun(data= object_i), dtplyr_trans= dtplyr_trans_fun(data= object_i),
dplyr= dplyr_fun(data= object_i),
times= 20) %>%
{data.frame(method= .$expr, time= .$time, class= class(object_i)[1])} %>%
mutate(method= gsub("data = object_i", "", method))
}
}) %>%
do.call("rbind.data.frame", .)
results %>%
mutate(method= factor(method, c("dt", "dtplyr", "dtplyr_trans", "dplyr")),
class= factor(class, unique(class))) %>%
ggplot(., aes(time, method, fill= class)) +
geom_boxplot() +
guides(fill= guide_legend(reverse= TRUE)) +
theme_bw()