I am doing some computationally demanding operations in R, so I am looking for the most efficient way to do them. My question is:
library(dplyr)
library(igraph)
library(bench)
set.seed(123)
edgelist <- data.frame(
node1 = sample(1:2000, 11000, replace = T),
node2 = sample(1:2000, 11000, replace = T),
weight = runif(11000, min = 0, max = 5)
)
g <- graph_from_data_frame(edgelist, directed = F)
#Data.frame
dat <- function() {
dm <- distances(g, weight = E(g)$weight)
UTIndex <- which(upper.tri(dm), arr.ind = T)
df1 <- data.frame(
verticeA = as.numeric(rownames(dm)[UTIndex[, 1]]),
verticeB = as.numeric(colnames(dm)[UTIndex[, 2]]),
path_length = as.numeric(dm[UTIndex])
)
}
#Matrix
mat <- function() {
dm <- distances(g, weight = E(g)$weight)
UTIndex <- which(upper.tri(dm), arr.ind = T)
df1 <- cbind(
verticeA = as.numeric(rownames(dm)[UTIndex[, 1]]),
verticeB = as.numeric(colnames(dm)[UTIndex[, 2]]),
path_length = as.numeric(dm[UTIndex])
)
}
####
results <- bench::mark(
dat = dat(),
mat = mat(),
check = F
)
t1 <- system.time({
df1 <- dat()
})
rm(df1)
t2 <- system.time({
df1 <- mat()
})
rm(df1)
Here are the output of t1
, t2
, and results
:
> results
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 dat 2.89s 2.89s 0.346 269MB 1.39 1 4 2.89s
2 mat 2.83s 2.83s 0.353 315MB 1.41 1 4 2.83s
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
> t1
user system elapsed
2.78 0.04 2.81
> t2
user system elapsed
3.12 0.08 3.21
When you benchmarking you are not sufficiently isolating the elements you are investigating (difference between the data.frame() and cbind()) function calls; the vast majority of computation in either test you ran is everything else but data.frame and cbind() and you are doing relatively few test comparisons which means you can mistake a random variation as being a significant difference.
In the following I isolate out the common parts, which are irrelevant to benchmark, and retain only the relevant parts; but I go further to open a discussion on R memory management.
lets start with code rewrite :
library(dplyr)
library(igraph)
library(bench)
set.seed(123)
edgelist <- data.frame(
node1 = sample(1:2000, 11000, replace = T),
node2 = sample(1:2000, 11000, replace = T),
weight = runif(11000, min = 0, max = 5)
)
g <- graph_from_data_frame(edgelist, directed = F)
dm <- distances(g, weight = E(g)$weight)
UTIndex <- which(upper.tri(dm), arr.ind = T)
verticeA <- as.numeric(rownames(dm)[UTIndex[, 1]])
verticeB <- as.numeric(colnames(dm)[UTIndex[, 2]])
path_length <- as.numeric(dm[UTIndex])
#Data.frame
datpure <- function(verticeA,verticeB,path_length) {
data.frame(
verticeA=verticeA,
verticeB=verticeB,
path_length =path_length
)
}
datpure(verticeA,verticeB,path_length)
#Matrix
matpure <- function(verticeA,verticeB,path_length) {
cbind(
verticeA=verticeA,
verticeB=verticeB,
path_length =path_length
)
}
matpure(verticeA,verticeB,path_length)
####
adjust_to_1s <- function(x){
x[,] <- 1
x
}
(results <- bench::mark(
dat_pure = datpure (verticeA,verticeB,path_length),
mat_pure = matpure (verticeA,verticeB,path_length),
dat_1 = adjust_to_1s(datpure (verticeA,verticeB,path_length)),
mat_1 = adjust_to_1s(matpure (verticeA,verticeB,path_length)),
check = F,
iterations = 100L,
time_unit = 'ms'
))
+ ))
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl> <int> <dbl> <dbl> <list> <list> <list> <list>
1 dat_pure 0.169 0.178 5350. 0B 54.0 99 1 18.5 <NULL> <Rprofmem [11 × 3]> <bench_tm> <tibble>
2 mat_pure 10.7 12.4 72.9 45.8MB 0.736 99 1 1358. <NULL> <Rprofmem [2 × 3]> <bench_tm> <tibble>
3 dat_1 182. 254. 4.00 292.8MB 0.347 92 8 23023. <NULL> <Rprofmem> <bench_tm> <tibble>
4 mat_1 30.3 37.9 23.8 53.4MB 0.241 99 1 4153. <NULL> <Rprofmem [2 × 3]> <bench_tm> <tibble>
We can see from this that data.frame() call itself is order of magnitude faster than cbind. I think the answer as to why is memory layout; for data.frame() R simply needs a list/data.frame() and to associate existing vectors with the names; R is copy on write only and otherwise works by reference, so the data.frame construction is essentially a trivial change over metadata. Whereas cbind is creating a matrix, which is essentially a single vector, and therefore must copy the data and layit out.
I add a variation where after the initial pure data.frame and cbind calls we radically alter the objects (set every entry to 1) now in both cases R has to write to memory and both are slowed. data.frame performs worse.