This is a follow-up to this question : data.table efficient recycling
The difference here is that the number of future years for each line is not necessarily the same ..
I frequently use recycling in data.table, for exemple when I need to make projections future years. I repeat my original data fro each future year.
This can lead to something like that :
library(data.table)
dt <- data.table(1:500000, 500000:1, rpois(500000, 240))
dt2 <- dt[, c(.SD, .(year = 1:V3)), by = 1:nrow(dt) ]
But I often have to deal with millions of lines, and far more columns than in this toy exemple. The time increases .. Try this :
library(data.table)
dt <- data.table(1:5000000, 5000000:1, rpois(5000000, 240))
dt2 <- dt[, c(.SD, .(year = 1:V3)), by = 1:nrow(dt) ]
My question is : is there a more efficient way to achieve this purpose ?
Thanks for any help !
Here is slightly improved version of the other answer.
unlist
rep.int
rather than rep
seq_len
rather than :
setDT
instead of data.table()
sequence
function suggested by @Colevecseq
Together it seems to make difference.
Timings...
library(data.table)
f0 = function(dt) {
dt[, c(.SD, .(year = 1:V3)), by = 1:nrow(dt) ]
}
f1 = function(dt) {
dt2 <- data.table(
rep(dt$V1, dt$V3),
rep(dt$V2, dt$V3),
rep(dt$V3, dt$V3),
unlist(lapply(dt$V3, function(x){1:x}))
)
dt2
}
f2 = function(dt) {
dt2 = list(
V1 = rep.int(dt$V1, dt$V3),
V2 = rep.int(dt$V2, dt$V3),
V3 = rep.int(dt$V3, dt$V3),
year = unlist(lapply(dt$V3, seq_len), recursive=FALSE, use.names=FALSE)
)
setDT(dt2)
dt2
}
f3 = function(dt) {
## even better with sequence function suggested by @Cole
dt2 = list(
V1 = rep.int(dt$V1, dt$V3),
V2 = rep.int(dt$V2, dt$V3),
V3 = rep.int(dt$V3, dt$V3),
year = sequence(dt$V3)
)
setDT(dt2)
dt2
}
f4 = function(dt) {
dt[, c(lapply(.SD, rep.int, V3), year = .(sequence(V3)))]
}
f5 = function(dt) {
dt2 = list(
V1 = rep.int(dt$V1, dt$V3),
V2 = rep.int(dt$V2, dt$V3),
V3 = rep.int(dt$V3, dt$V3),
year = data.table:::vecseq(rep.int(1L,length(dt$V3)), dt$V3, NULL)
)
setDT(dt2)
dt2
}
On a "big" data
dt <- data.table(1:5000000, 5000000:1, rpois(5000000, 240))
system.time(f0(dt))
# user system elapsed
# 22.100 18.914 40.449
system.time(f1(dt))
# user system elapsed
# 35.866 15.607 51.475
system.time(f2(dt))
# user system elapsed
# 22.922 6.839 29.760
system.time(f3(dt))
# user system elapsed
# 6.509 6.723 13.233
system.time(f4(dt))
# user system elapsed
# 12.140 14.114 26.254
system.time(f5(dt))
# user system elapsed
# 6.448 4.057 10.506
Anyway, you should try to improve your processes that you are running on expanded dataset because maybe you don't have to expand that in the first place.
For example, in frollmean
function there is an argument adaptive
which makes it possible to calculate rolling mean on a variable length window, where normally to compute that one would need to expand data first.
V3
in your data reminds a lot a length of a window for adaptive moving average.