Why is sum in sapply suddenly faster than base::colSums?

I just realized using sapply(df, sum) is faster than base::colSums(df). Also when using a matrix base::colSums(M). Why? Did I miss something? Was it always like this?

Benchmark

Results of a benchmark on a 10K×10K data frame:

$ Rscript --vanilla speed_test.R 
Unit: milliseconds
      expr       min        lq      mean   median       uq       max neval  cld
    sapply 125.76717 125.77749 126.21036 125.7878 126.4320 127.07610     3 a   
   colSums 288.09562 293.57566 298.28873 299.0557 303.3853 307.71486     3  b  
 colSums_M 137.68780 139.08794 141.25548 140.4881 143.0393 145.59055     3   c 
  colSums2  55.49845  55.86153  56.04262  56.2246  56.3147  56.40479     3    d

  # A tibble: 4 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory               time           gc              
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>               <list>         <list>          
1 sapply      124.8ms  124.9ms      8.01   440.8KB     0        5     0      625ms <NULL> <Rprofmem [6 × 3]>   <bench_tm [5]> <tibble [5 × 3]>
2 colSums     285.7ms  329.5ms      3.04   763.4MB     3.04     2     2      659ms <NULL> <Rprofmem [403 × 3]> <bench_tm [2]> <tibble [2 × 3]>
3 colSums_M     131ms    131ms      7.63    78.2KB     0        4     0      524ms <NULL> <Rprofmem [1 × 3]>   <bench_tm [4]> <tibble [4 × 3]>
4 colSums2     55.4ms   55.8ms     17.9     78.2KB     0        9     0      502ms <NULL> <Rprofmem [1 × 3]>   <bench_tm [9]> <tibble [9 × 3]>

Note: R version 4.4.2 (2024-10-31) on AlmaLinux 9.5. NETLIB or OPENBLAS-OPENMP (doesn't matter), AMD Ryzen 7 7700X.

Edit

Same benchmark on an (old) AMD FX(tm)-8350:

$ Rscript --vanilla speed_test.R 
## col sums:
Unit: milliseconds
      expr      min       lq     mean   median       uq      max neval  cld
    sapply 169.5607 169.7373 169.8201 169.9138 169.9499 169.9859     3 a   
   colSums 573.5435 575.7917 576.6825 578.0399 578.2520 578.4640     3  b  
 colSums_M 130.5275 130.6009 130.6255 130.6744 130.6745 130.6746     3   c 
  colSums2 148.7892 149.0359 149.4866 149.2826 149.8354 150.3881     3    d

# A tibble: 4 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory               time           gc              
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>               <list>         <list>          
1 sapply        170ms    174ms      5.79   440.8KB     0        3     0      518ms <NULL> <Rprofmem [6 × 3]>   <bench_tm [3]> <tibble [3 × 3]>
2 colSums       703ms    703ms      1.42   763.4MB     1.42     1     1      703ms <NULL> <Rprofmem [399 × 3]> <bench_tm [1]> <tibble [1 × 3]>
3 colSums_M     131ms    131ms      7.63    78.2KB     0        4     0      524ms <NULL> <Rprofmem [1 × 3]>   <bench_tm [4]> <tibble [4 × 3]>
4 colSums2      150ms    150ms      6.65    78.2KB     0        4     0      602ms <NULL> <Rprofmem [1 × 3]>   <bench_tm [4]> <tibble [4 × 3]>

Maybe base::colSums isn't optimized yet for newer hardware?

Code:

set.seed(42)

m <- 1e4; n <- 1e4
M <- matrix(rnorm(m*n), m, n)
df <- data.frame(M)

options(width=200)

microbenchmark::microbenchmark(
  sapply=sapply(df, sum),
  colSums=colSums(df),
  colSums_M=colSums(M),  ## <-- USING MATRIX INPUT TO AVOID as.matrix() OVERHEAD
  colSums2=matrixStats::colSums2(M),
  times=3L,
  check='equivalent'
  ) |> print()

bench::mark(sapply=sapply(df, sum),
            colSums=colSums(df),
            colSums_M=colSums(M),
            colSums2=matrixStats::colSums2(M), check=FALSE)

Solution

The reason is that the data.frame is coerced to a matrix by colSums, which is an expensive operation. colSums is intended as a faster alternative to apply(x, 2, sum), which only makes sense for a matrix (array).

We can profile colSums:

Rprof(tmp <- tempfile(), interval = 0.002)
y <- colSums(df)
Rprof(NULL)
summaryRprof(tmp)
unlink(tmp)

On my system, the top entries in "by.total" are:

#                       total.time total.pct self.time self.pct
#"colSums"                   0.048        96     0.018       36
#"as.matrix.data.frame"      0.030        60     0.002        4
#"as.matrix"                 0.030        60     0.000        0
#"unlist"                    0.022        44     0.022       44

You can see that 60 % of the total time is spend in as.matrix. The R loop is simply faster than a combination of as.matrix(x) and a C loop.

Also, from a algorithm perspective, looping over elements of a list is faster than looping over a vector and keeping track of which column the elements belong to. Obviously, a C implmentation can be optimized more than an R implementation as matrixStats shows.

PS: You should not use microbenchmark here because the garbage collector can distort the timings. Better use the bench package which filters out iterations where gc was active.

library(bench)
mark(sapply=sapply(df, sum),
     colSums=colSums(df),
     colSums_M=colSums(M),
     colSums2=matrixStats::colSums2(M))

## A tibble: 4 × 13
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result         memory             time           gc              
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>         <list>             <list>         <list>          
#1 sapply      148.5ms  149.3ms      6.26   440.8KB     0        4     0      639ms <dbl [10,000]> <Rprofmem [6 × 3]> <bench_tm [4]> <tibble [4 × 3]>
#2 colSums     372.3ms  374.2ms      2.67   763.4MB     2.67     2     2      748ms <dbl [10,000]> <Rprofmem [9 × 3]> <bench_tm [2]> <tibble [2 × 3]>
#3 colSums_M   141.5ms    142ms      7.05    78.2KB     0        4     0      567ms <dbl [10,000]> <Rprofmem [1 × 3]> <bench_tm [4]> <tibble [4 × 3]>
#4 colSums2     71.1ms   71.6ms     14.0     78.2KB     0        7     0      501ms <dbl [10,000]> <Rprofmem [1 × 3]> <bench_tm [7]> <tibble [7 × 3]>
#Warning message:
#Some expressions had a GC in every iteration; so filtering is disabled.

Look at the memory allocation and the n_gc column. The garbage collector is active during every colSums(df) call.