I want to rowbind multiple dataframes each with a varying number of rows. I understand that using a for loop in which something like a final_df
is overwritten in each iteration is very slow as R has to keep a copy of every alteration. Usually, this problem can be solved by pre-allocating the correct number of rows and columns to the dataframe and then modifying it in place for each iteration of the loop. However, in my situation this is a bit more tricky as each individual dataframe may have a different number of rows compared to the previous one. (In my actual code I am dealing with a long list of XML files, from which I am extracting certain bits of information. Depending on the file, I can end up with more rows or fewer rows.)
My attempt so far is to use dplyr::bind_rows()
or data.table::rbindlist()
, which seem to be performing similarly well and both outclass do.call("rbind")
by a lot. However, I notice that even with these approaches, the computation speed will still increase nonlinearly if I increase the number of dataframes.
Do you have suggestions on how I can further improve the speed of my code? Thanks a lot in advance!
create_df <- function() {
nrow <- sample(12:15, 1)
ncol <- 10
randomdf <- matrix(rnorm(nrow*ncol), nrow=nrow, ncol=ncol) |> data.frame()
return(randomdf)
}
approach1 <- function(n) {
final_df <<- matrix(ncol=ncol, nrow=0)
for(i in 1:n) {
current_df <- create_df()
final_df <<- rbind(final_df, current_df)
}
}
approach2 <- function(n) {
df_list <<- vector("list", n)
for(i in 1:n) {
df_list[[i]] <<- create_df()
}
final_df <<- do.call("rbind", df_list)
}
approach3 <- function(n) {
df_list <<- vector("list", n)
for(i in 1:n) {
df_list[[i]] <<- create_df()
}
final_df <<- dplyr::bind_rows(df_list)
}
approach4 <- function(n) {
df_list <<- vector("list", n)
for(i in 1:n) {
df_list[[i]] <<- create_df()
}
final_df <<- data.table::rbindlist(df_list)
}
microbenchmark::microbenchmark(
approach1(5),
approach2(5),
approach3(5),
approach4(5),
approach1(50),
approach2(50),
approach3(50),
approach4(50),
approach1(500),
approach2(500),
approach3(500),
approach4(500),
times = 10
)
Unit: microseconds
expr min lq mean median uq max neval
approach1(5) 1173.5 1201.1 1317.12 1285.30 1402.2 1557.0 10
approach2(5) 771.6 781.8 1121.18 829.15 944.6 3573.1 10
approach3(5) 543.7 613.4 966.10 672.15 952.4 3131.8 10
approach4(5) 520.8 586.5 641.18 621.65 663.8 818.8 10
approach1(50) 12186.9 12381.4 13932.40 12760.10 14518.8 18537.4 10
approach2(50) 6497.6 6766.0 7160.26 6967.55 7230.3 8390.6 10
approach3(50) 3681.3 4143.1 4258.44 4233.10 4347.8 5022.8 10
approach4(50) 3806.7 3821.8 4166.71 3962.95 4190.6 5900.4 10
approach1(500) 275530.0 285779.1 326732.16 294302.30 304461.0 622130.3 10
approach2(500) 65243.8 67456.7 72789.76 74422.30 77063.0 79485.0 10
approach3(500) 38600.0 39328.4 41372.67 41215.80 42345.2 47488.8 10
approach4(500) 32496.5 36788.1 41160.35 39940.10 46043.2 49752.9 10
approach3
and approach4
are spending most of their time in create_df
, so you're not getting a good idea of the speed of the binding operation. Better to time just the binding:
library(dplyr)
library(data.table)
create_df <- function(n) {
nrow <- sample(12:15, 1)
ncol <- 10
randomdf <- matrix(rnorm(nrow*ncol), nrow=nrow, ncol=ncol) |> data.frame()
return(randomdf)
}
df_list <- lapply(c(5, 50, 500), \(n) lapply(1:n, create_df))
approach2 <- function(i) do.call("rbind", df_list[[i]])
approach3 <- function(i) bind_rows(df_list[[i]])
approach4 <- function(i) rbindlist(df_list[[i]])
approach5 <- function(i) rbindlist(df_list[[i]], FALSE)
microbenchmark::microbenchmark(
approach2(1),
approach3(1),
approach4(1),
approach5(1),
approach2(2),
approach3(2),
approach4(2),
approach5(2),
approach2(3),
approach3(3),
approach4(3),
approach5(3)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> approach2(1) 321.1 360.40 389.968 377.25 406.65 601.5 100
#> approach3(1) 89.9 118.85 157.806 135.80 191.45 690.2 100
#> approach4(1) 77.2 89.05 176.894 103.05 161.15 4250.6 100
#> approach5(1) 61.8 70.10 100.532 94.15 120.60 223.7 100
#> approach2(2) 3070.4 3228.40 3735.250 3352.30 3574.90 8796.5 100
#> approach3(2) 348.3 408.35 470.308 440.50 514.70 931.6 100
#> approach4(2) 136.7 169.65 204.703 189.25 222.40 362.6 100
#> approach5(2) 111.5 133.85 194.793 150.10 199.50 2957.8 100
#> approach2(3) 31565.1 34130.30 36182.204 35523.60 36503.40 89033.4 100
#> approach3(3) 3008.7 3268.30 3785.467 3440.65 3714.85 7923.1 100
#> approach4(3) 794.4 913.45 1009.823 966.20 1054.20 1692.0 100
#> approach5(3) 655.8 767.35 870.240 822.45 894.95 2124.1 100
Now it is clear that rbindlist
is the fastest for larger lists of tables. If your process is taking a long time, the binding operation probably isn't the first place I would look.
If you know the table columns all line up, you can squeeze a little more performance out of rbindlist
by setting the use.names
argument to FALSE
.