rperformanceloopsrowliterate-programming

Faster alternative methods to for-loop


I have a database with thousands of records. The "for i loop" iteration takes a long time to execute, it is not possible to use it because of the size of the database.

I need to perform an iteration where:

  1. if row i of x is equal to row i-1 of x then,
  2. y(i) = y(i-1) + z(i), y is a cumulative variable
  3. Else y(i) = z(i)

Previously, the database is sorted by variable x

for (i in 2:n) {
  if (db$x[i] == db$x[i - 1]) {
    db$y[i] <- db$y[i - 1] + db$z[i]  
  } else {
    db$y[i] <- db$z[i]
  }
}

Is there a way to make it more efficient?

Any help will be appreciated


Solution

  • First, some repro data:

    set.seed(1)
    n<-100000
    db<-data.frame(x = sample(10, n, TRUE), z = runif(n))
    

    Then, rewrite your code correcting some index to make it work:

    db$y<-0
    db$y[1]<-db$z[1]
    for( i in 2:n) {
      if (db$x[i]==db$x[i-1]) {
          db$y[i] <- db$y[i-1] + db$z[i]  
          } else {db$y[i] <- db$z[i]}
      if (i%%1000==0) cat(i,"\n")
      
    }
    

    You can try:

    require(data.table)
    xr<-rleid(db$x)
    y<-ave(db$z, xr, FUN = cumsum)
    #check whether the results match
    all.equal(y, db$y)