I have a database with thousands of records. The "for i loop" iteration takes a long time to execute, it is not possible to use it because of the size of the database.
I need to perform an iteration where:
Previously, the database is sorted by variable x
for (i in 2:n) {
if (db$x[i] == db$x[i - 1]) {
db$y[i] <- db$y[i - 1] + db$z[i]
} else {
db$y[i] <- db$z[i]
}
}
Is there a way to make it more efficient?
Any help will be appreciated
First, some repro data:
set.seed(1)
n<-100000
db<-data.frame(x = sample(10, n, TRUE), z = runif(n))
Then, rewrite your code correcting some index to make it work:
db$y<-0
db$y[1]<-db$z[1]
for( i in 2:n) {
if (db$x[i]==db$x[i-1]) {
db$y[i] <- db$y[i-1] + db$z[i]
} else {db$y[i] <- db$z[i]}
if (i%%1000==0) cat(i,"\n")
}
You can try:
require(data.table)
xr<-rleid(db$x)
y<-ave(db$z, xr, FUN = cumsum)
#check whether the results match
all.equal(y, db$y)