rperformancedata.tableaggregate

How to optimise an aggregation function with conditions?


I have an aggregation function that sums groups of data then creates a flag based on a set of conditions and assigns that to the group. The issue is that there is a large number of groups to aggregate and each group is very small.

This means that the time required to perform an aggregate, even for a modest-sized dataset, is prohibitively long and this needs to work on datasets with millions of rows.

I've created a reproducible example below which illustrates the problem and has a similar style of logic to my aggregation function (mine has more conditions, but they're similar in nature):

#install.packages("palmerpenguins")
library(data.table)
library(palmerpenguins)

# Create data
GROUPS <- 1:100
penguin_list <- lapply(GROUPS, \(x) data.table(group = x, penguins))
penguin_table <- rbindlist(penguin_list)

# Aggregation function
aggregatePenguinMass <- function(mass, sex, ratio = 2/3){
    data <- data.table(mass, sex)
    
    total <- sum(data[,mass], na.rm = TRUE)
    
    n_sex <- data[,.N, by = sex]
    n_male <- n_sex[sex == "male", N]
    n_female <- n_sex[sex == "female", N]
    
    if(n_female >= ratio * (n_male + n_female)){
        return(data.table(total = total,
                          flag = "F"))
    } else {
        return(data.table(total = total,
                          flag = "M"))
    }
}

# Perform aggregation and time
system.time(
penguin_table[, aggregatePenguinMass(body_mass_g, sex), by = .(group, species, year)]
)
#   user  system elapsed 
#   2.66    0.47    7.30 

How can I change this function or the way I'm performing the aggregation to make it an order of magnitude faster?


Benchmarks

Unit: milliseconds
         expr       min        lq       mean    median        uq       max neval
       base() 1793.4883 1803.1108 2491.32386 1821.8588 1928.4540 8195.9058    10
 RBarradas1() 1554.5428 1577.5579 1614.60087 1583.7645 1637.0681 1764.4564    10
 RBarradas2()  119.5481  127.6831  186.11583  131.7930  141.9157  657.2762    10
    NGraham()   23.2437   23.8428   25.50738   24.5770   27.4004   30.6569    10
       Miff()  427.2106  440.3240  462.32146  455.1878  462.6939  577.3803    10

Solution

  • I think you want to avoid building as many data.tables as you have groups, and work with one large data.table only. I can reproduce your aggregation with dtyplr (as I'm not so practiced in data.table syntax). that looks like

    library(dtplyr)
    
    agg_dtyplr <- function(dt,mass, ratio = 2 / 3) {
      
      lazy_dt(dt) |> group_by(group, species, year)   |>
        summarise(
           total = sum({{mass}}, na.rm = TRUE),
        
           n_female = sum(1 * (sex == "female"),na.rm=TRUE),
           n = n())|> 
          mutate(
           flag = if_else(n_female >= ratio * n,
             "F", "M"
           )
         ) |> 
        select(group,species,year,total,flag) |> 
       as.data.table()
    }
    
      agg_dtyplr(penguin_table,body_mass_g)