I just realized using sapply(df, sum)
is faster than base::colSums(df)
. Also when using a matrix base::colSums(M)
. Why? Did I miss something? Was it always like this?
Results of a benchmark on a 10K×10K data frame:
$ Rscript --vanilla speed_test.R
Unit: milliseconds
expr min lq mean median uq max neval cld
sapply 125.76717 125.77749 126.21036 125.7878 126.4320 127.07610 3 a
colSums 288.09562 293.57566 298.28873 299.0557 303.3853 307.71486 3 b
colSums_M 137.68780 139.08794 141.25548 140.4881 143.0393 145.59055 3 c
colSums2 55.49845 55.86153 56.04262 56.2246 56.3147 56.40479 3 d
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 sapply 124.8ms 124.9ms 8.01 440.8KB 0 5 0 625ms <NULL> <Rprofmem [6 × 3]> <bench_tm [5]> <tibble [5 × 3]>
2 colSums 285.7ms 329.5ms 3.04 763.4MB 3.04 2 2 659ms <NULL> <Rprofmem [403 × 3]> <bench_tm [2]> <tibble [2 × 3]>
3 colSums_M 131ms 131ms 7.63 78.2KB 0 4 0 524ms <NULL> <Rprofmem [1 × 3]> <bench_tm [4]> <tibble [4 × 3]>
4 colSums2 55.4ms 55.8ms 17.9 78.2KB 0 9 0 502ms <NULL> <Rprofmem [1 × 3]> <bench_tm [9]> <tibble [9 × 3]>
Note: R version 4.4.2 (2024-10-31) on AlmaLinux 9.5. NETLIB or OPENBLAS-OPENMP (doesn't matter), AMD Ryzen 7 7700X.
Same benchmark on an (old) AMD FX(tm)-8350:
$ Rscript --vanilla speed_test.R
## col sums:
Unit: milliseconds
expr min lq mean median uq max neval cld
sapply 169.5607 169.7373 169.8201 169.9138 169.9499 169.9859 3 a
colSums 573.5435 575.7917 576.6825 578.0399 578.2520 578.4640 3 b
colSums_M 130.5275 130.6009 130.6255 130.6744 130.6745 130.6746 3 c
colSums2 148.7892 149.0359 149.4866 149.2826 149.8354 150.3881 3 d
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 sapply 170ms 174ms 5.79 440.8KB 0 3 0 518ms <NULL> <Rprofmem [6 × 3]> <bench_tm [3]> <tibble [3 × 3]>
2 colSums 703ms 703ms 1.42 763.4MB 1.42 1 1 703ms <NULL> <Rprofmem [399 × 3]> <bench_tm [1]> <tibble [1 × 3]>
3 colSums_M 131ms 131ms 7.63 78.2KB 0 4 0 524ms <NULL> <Rprofmem [1 × 3]> <bench_tm [4]> <tibble [4 × 3]>
4 colSums2 150ms 150ms 6.65 78.2KB 0 4 0 602ms <NULL> <Rprofmem [1 × 3]> <bench_tm [4]> <tibble [4 × 3]>
Maybe base::colSums
isn't optimized yet for newer hardware?
Code:
set.seed(42)
m <- 1e4; n <- 1e4
M <- matrix(rnorm(m*n), m, n)
df <- data.frame(M)
options(width=200)
microbenchmark::microbenchmark(
sapply=sapply(df, sum),
colSums=colSums(df),
colSums_M=colSums(M), ## <-- USING MATRIX INPUT TO AVOID as.matrix() OVERHEAD
colSums2=matrixStats::colSums2(M),
times=3L,
check='equivalent'
) |> print()
bench::mark(sapply=sapply(df, sum),
colSums=colSums(df),
colSums_M=colSums(M),
colSums2=matrixStats::colSums2(M), check=FALSE)
The reason is that the data.frame is coerced to a matrix by colSums
, which is an expensive operation. colSums
is intended as a faster alternative to apply(x, 2, sum)
, which only makes sense for a matrix (array).
We can profile colSums
:
Rprof(tmp <- tempfile(), interval = 0.002)
y <- colSums(df)
Rprof(NULL)
summaryRprof(tmp)
unlink(tmp)
On my system, the top entries in "by.total" are:
# total.time total.pct self.time self.pct
#"colSums" 0.048 96 0.018 36
#"as.matrix.data.frame" 0.030 60 0.002 4
#"as.matrix" 0.030 60 0.000 0
#"unlist" 0.022 44 0.022 44
You can see that 60 % of the total time is spend in as.matrix
. The R loop is simply faster than a combination of as.matrix(x)
and a C loop.
Also, from a algorithm perspective, looping over elements of a list is faster than looping over a vector and keeping track of which column the elements belong to. Obviously, a C implmentation can be optimized more than an R implementation as matrixStats
shows.
PS: You should not use microbenchmark here because the garbage collector can distort the timings. Better use the bench package which filters out iterations where gc
was active.
library(bench)
mark(sapply=sapply(df, sum),
colSums=colSums(df),
colSums_M=colSums(M),
colSums2=matrixStats::colSums2(M))
## A tibble: 4 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
#1 sapply 148.5ms 149.3ms 6.26 440.8KB 0 4 0 639ms <dbl [10,000]> <Rprofmem [6 × 3]> <bench_tm [4]> <tibble [4 × 3]>
#2 colSums 372.3ms 374.2ms 2.67 763.4MB 2.67 2 2 748ms <dbl [10,000]> <Rprofmem [9 × 3]> <bench_tm [2]> <tibble [2 × 3]>
#3 colSums_M 141.5ms 142ms 7.05 78.2KB 0 4 0 567ms <dbl [10,000]> <Rprofmem [1 × 3]> <bench_tm [4]> <tibble [4 × 3]>
#4 colSums2 71.1ms 71.6ms 14.0 78.2KB 0 7 0 501ms <dbl [10,000]> <Rprofmem [1 × 3]> <bench_tm [7]> <tibble [7 × 3]>
#Warning message:
#Some expressions had a GC in every iteration; so filtering is disabled.
Look at the memory allocation and the n_gc
column. The garbage collector is active during every colSums(df)
call.