rdataframeperformancemicrobenchmarkedge-list

How can data.frames be faster than matrices?


I am doing some computationally demanding operations in R, so I am looking for the most efficient way to do them. My question is:

  1. Why is creating a data.frame seemingly faster than creating a matrix? In my understanding, the general agreement is that matrices are faster than data.frames if all data is the same type. Here they are not.
library(dplyr)
library(igraph)
library(bench)

set.seed(123)

edgelist <- data.frame(
  node1 = sample(1:2000, 11000, replace = T),
  node2 = sample(1:2000, 11000, replace = T),
  weight = runif(11000, min = 0, max = 5)
)

g <- graph_from_data_frame(edgelist, directed = F)

#Data.frame
dat <- function() {

  dm <- distances(g, weight = E(g)$weight)
  
  UTIndex <- which(upper.tri(dm), arr.ind = T)

  df1 <- data.frame(
    verticeA = as.numeric(rownames(dm)[UTIndex[, 1]]),
    verticeB = as.numeric(colnames(dm)[UTIndex[, 2]]),
    path_length = as.numeric(dm[UTIndex])
  )
}

#Matrix
mat <- function() {
  
  dm <- distances(g, weight = E(g)$weight)
  
  UTIndex <- which(upper.tri(dm), arr.ind = T)
  
  df1 <- cbind(
    verticeA = as.numeric(rownames(dm)[UTIndex[, 1]]),
    verticeB = as.numeric(colnames(dm)[UTIndex[, 2]]),
    path_length = as.numeric(dm[UTIndex])
  )
}
####

results <- bench::mark(
  dat = dat(),
  mat = mat(),
  check = F
)

t1 <- system.time({
  df1 <- dat()
})

rm(df1)

t2 <- system.time({
  df1 <- mat()
})

rm(df1)

Here are the output of t1, t2, and results:

> results
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 dat           2.89s    2.89s     0.346     269MB     1.39     1     4      2.89s
2 mat           2.83s    2.83s     0.353     315MB     1.41     1     4      2.83s
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
> t1
   user  system elapsed 
   2.78    0.04    2.81 
> t2
   user  system elapsed 
   3.12    0.08    3.21 

Solution

  • When you benchmarking you are not sufficiently isolating the elements you are investigating (difference between the data.frame() and cbind()) function calls; the vast majority of computation in either test you ran is everything else but data.frame and cbind() and you are doing relatively few test comparisons which means you can mistake a random variation as being a significant difference.

    In the following I isolate out the common parts, which are irrelevant to benchmark, and retain only the relevant parts; but I go further to open a discussion on R memory management.

    lets start with code rewrite :

    library(dplyr)
    library(igraph)
    library(bench)
    
    set.seed(123)
    
    edgelist <- data.frame(
      node1 = sample(1:2000, 11000, replace = T),
      node2 = sample(1:2000, 11000, replace = T),
      weight = runif(11000, min = 0, max = 5)
    )
    
    g <- graph_from_data_frame(edgelist, directed = F)
    
    dm <- distances(g, weight = E(g)$weight)
    
    UTIndex <- which(upper.tri(dm), arr.ind = T)
    verticeA <- as.numeric(rownames(dm)[UTIndex[, 1]])
    verticeB  <- as.numeric(colnames(dm)[UTIndex[, 2]])     
    path_length <- as.numeric(dm[UTIndex])
    
    #Data.frame
    datpure <- function(verticeA,verticeB,path_length) {
      data.frame(
        verticeA=verticeA,
        verticeB=verticeB,
        path_length =path_length
      )
    }
    datpure(verticeA,verticeB,path_length)
    
    #Matrix
    matpure <- function(verticeA,verticeB,path_length) {
       cbind(
        verticeA=verticeA,
        verticeB=verticeB,
        path_length =path_length
      )
    }
    matpure(verticeA,verticeB,path_length)
    ####
    
    adjust_to_1s <- function(x){
      x[,] <- 1
      x
    }
    
    (results <- bench::mark(
      dat_pure = datpure (verticeA,verticeB,path_length),
      mat_pure = matpure (verticeA,verticeB,path_length),
      dat_1 = adjust_to_1s(datpure (verticeA,verticeB,path_length)),
      mat_1 = adjust_to_1s(matpure (verticeA,verticeB,path_length)),
      check = F,
      iterations = 100L,
      time_unit = 'ms'
    ))
    
    + ))
    # A tibble: 4 × 13
      expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory              time       gc      
      <bch:expr>   <dbl>   <dbl>     <dbl> <bch:byt>    <dbl> <int> <dbl>      <dbl> <list> <list>              <list>     <list>  
    1 dat_pure     0.169   0.178   5350.          0B   54.0      99     1       18.5 <NULL> <Rprofmem [11 × 3]> <bench_tm> <tibble>
    2 mat_pure    10.7    12.4       72.9     45.8MB    0.736    99     1     1358.  <NULL> <Rprofmem [2 × 3]>  <bench_tm> <tibble>
    3 dat_1      182.    254.         4.00   292.8MB    0.347    92     8    23023.  <NULL> <Rprofmem>          <bench_tm> <tibble>
    4 mat_1       30.3    37.9       23.8     53.4MB    0.241    99     1     4153.  <NULL> <Rprofmem [2 × 3]>  <bench_tm> <tibble>
    

    We can see from this that data.frame() call itself is order of magnitude faster than cbind. I think the answer as to why is memory layout; for data.frame() R simply needs a list/data.frame() and to associate existing vectors with the names; R is copy on write only and otherwise works by reference, so the data.frame construction is essentially a trivial change over metadata. Whereas cbind is creating a matrix, which is essentially a single vector, and therefore must copy the data and layit out.

    I add a variation where after the initial pure data.frame and cbind calls we radically alter the objects (set every entry to 1) now in both cases R has to write to memory and both are slowed. data.frame performs worse.