rdata.tabletiming

Why is fifelse faster than checking a subset in data.table?


I'm comparing the speed difference between these two methods of updating a column on a data table (for testing, the flag column is 50% TRUE, 50% FALSE):

A[flag==TRUE,b:=b + myfunc(a,b)]
A[,b:=b + fifelse(flag,myfunc(a,b),0)]

Naively, I would expect the first case to be faster, since they perform the same number of if checks on the flag column, the same number of computations, but the second method has a number of gratuitous b+0 calculations and b:=b reassignments. However, the results I got using the full code below are:

[1] "Total time run 1: 0.281515121459961"
[1] "Total time run 2: 0.227149963378906"

The second method is almost 24% faster. Why is this the case?

Full test code:

library(data.table)

n = 10000000

# Random non-trivial function
myfunc= function(x,y) {
  return((x+y)*x/(y^2))
}


A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run1Start = Sys.time()
A[flag==TRUE,b:=b + myfunc(a,b)]
run1End = Sys.time()
print(paste0("Total time run 1: ", difftime(run1End,run1Start,units="secs")))


A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run2Start = Sys.time()
A[,b:=b + fifelse(flag,myfunc(a,b),0)]
run2End = Sys.time()
print(paste0("Total time run 2: ", difftime(run2End,run2Start,units="secs")))


Solution

  • When you do:

    A[flag == TRUE, b := b + myfunc(a,b)]
    

    R first has to process which rows satisfy flag == TRUE (i.e. it does a subsetting operation), then perform the function on those rows. Even though you're only running the function on ~1/2 the rows, that extra step of subsetting adds overhead.

    On the other hand, when you use:

    A[, b := b + fifelse(flag, myfunc(a,b), 0)]
    

    the entire operation is done in one go and reduces/removes that overhead.

    However, since flag is a boolean vector, the fastest approach may be:

    A[(flag), b:=b + myfunc(a,b)]
    

    This may seem counterintuitive since it's similar to the first approach, but it takes advantage of the logical vector to subset the rows (not evaluate the rows).

    microbenchmark::microbenchmark(
      x1 = A[flag==TRUE,b:=b + myfunc(a,b)],
      x2 = A[,b:=b + fifelse(flag,myfunc(a,b),0)],
      y1 = A[(flag), b:=b + myfunc(a,b)]
    )
    
    Unit: milliseconds
     expr      min       lq     mean   median       uq       max neval cld
       x1 149.3406 234.1432 333.2438 300.7187 375.2823 1040.9021   100  a 
       x2  94.7926 161.1049 246.2662 218.9331 267.1994  904.3553   100   b
       y1 100.1030 153.8516 209.7753 188.3112 228.9370  866.6606   100   b
    

    An imperfect analogy would be if you were a gardener needing to only water half your plants. The first approach you would walk through and tag the ones needing watering while someone else follows behind and waters where needed. The second approach, you would identify the plants that need watering and water them on the spot. The third approach has all the plants needed for watering already grouped together.