I have an aggregation function that sums groups of data then creates a flag based on a set of conditions and assigns that to the group. The issue is that there is a large number of groups to aggregate and each group is very small.
This means that the time required to perform an aggregate, even for a modest-sized dataset, is prohibitively long and this needs to work on datasets with millions of rows.
I've created a reproducible example below which illustrates the problem and has a similar style of logic to my aggregation function (mine has more conditions, but they're similar in nature):
#install.packages("palmerpenguins")
library(data.table)
library(palmerpenguins)
# Create data
GROUPS <- 1:100
penguin_list <- lapply(GROUPS, \(x) data.table(group = x, penguins))
penguin_table <- rbindlist(penguin_list)
# Aggregation function
aggregatePenguinMass <- function(mass, sex, ratio = 2/3){
data <- data.table(mass, sex)
total <- sum(data[,mass], na.rm = TRUE)
n_sex <- data[,.N, by = sex]
n_male <- n_sex[sex == "male", N]
n_female <- n_sex[sex == "female", N]
if(n_female >= ratio * (n_male + n_female)){
return(data.table(total = total,
flag = "F"))
} else {
return(data.table(total = total,
flag = "M"))
}
}
# Perform aggregation and time
system.time(
penguin_table[, aggregatePenguinMass(body_mass_g, sex), by = .(group, species, year)]
)
# user system elapsed
# 2.66 0.47 7.30
How can I change this function or the way I'm performing the aggregation to make it an order of magnitude faster?
Unit: milliseconds
expr min lq mean median uq max neval
base() 1793.4883 1803.1108 2491.32386 1821.8588 1928.4540 8195.9058 10
RBarradas1() 1554.5428 1577.5579 1614.60087 1583.7645 1637.0681 1764.4564 10
RBarradas2() 119.5481 127.6831 186.11583 131.7930 141.9157 657.2762 10
NGraham() 23.2437 23.8428 25.50738 24.5770 27.4004 30.6569 10
Miff() 427.2106 440.3240 462.32146 455.1878 462.6939 577.3803 10
I think you want to avoid building as many data.tables as you have groups, and work with one large data.table only. I can reproduce your aggregation with dtyplr (as I'm not so practiced in data.table syntax). that looks like
library(dtplyr)
agg_dtyplr <- function(dt,mass, ratio = 2 / 3) {
lazy_dt(dt) |> group_by(group, species, year) |>
summarise(
total = sum({{mass}}, na.rm = TRUE),
n_female = sum(1 * (sex == "female"),na.rm=TRUE),
n = n())|>
mutate(
flag = if_else(n_female >= ratio * n,
"F", "M"
)
) |>
select(group,species,year,total,flag) |>
as.data.table()
}
agg_dtyplr(penguin_table,body_mass_g)