I'm having "Error: cannot allocate vector of size ...MB" problem using ff/ffdf and ffdfdply function.
I'm trying to use ff and ffdf packages to process large amount of data that has been keyed into groups. Data (in ffdf table format) looks like this:
x =
id_1 id_2 month year Amount key
1 13 1 2013 -200 11
1 13 2 2013 300 54
2 19 1 2013 300 82
3 33 2 2013 300 70
.... (10+ Million rows)
The unique keys are created using something like:
x$key = as.ff(as.integer(ikey(x[c("id_1","id_2","month","year")])))
To summarise by grouping using the key variable, I have this command:
summary = ffdfdply(x=x, split=x$key, FUN=function(df) {
df = data.table(df)
df = df[,list(id_1 = id_1[1], withdraw = sum(Amount*(Amount>0),na.rm=T), by = "key"]
df
},trace=T)
Using data.table's excellent grouping feature (idea taken from this discussion). In the real code there are more functions to be applied to the Amount variable, but even with this I can not process the full ffdf table (a smaller subset of the table works fine).
It seems like ffdfdplyis using huge amount of ram, giving the:
Error: cannot allocate vector of size 64MB
Also BATCHBYTES does not seem to help. Any one with experience with ffdffply can recommend any other way to go about this, without pre-splitting the ffdf table into chunks?
The most difficult part about using ff/ffbase is making sure your data stays in ff and not accidently put it in RAM. As once you will have put your data in RAM (mostly due to some misunderstanding of when data is put in RAM and when it is not), it is hard to get your RAM back from R and if you are working on your RAM limit, a small extra request of RAM will get your 'Error: cannot allocate vector of size'.
Now, I think you misspecified the input to ikey. Look at ?ikey
, it requires as input argument an ffdf, not several ff vectors. Probably this has put your data in RAM while what you wanted is probably to use ikey(x[c("id_1","id_2","month","year")])
It simulated some data on my computer as follows to get an ffdf with 24Mio rows, and the following does not give me RAM troubles (it uses approx 3.5Gb of RAM in my machine)
require(ffbase)
require(data.table)
x <- expand.ffgrid(id_1 = ffseq(1, 1000), id_2 = ffseq(1, 1000), year = as.ff(c(2012,2013)), month = as.ff(1:12))
x$Amount <- ffrandom(nrow(x), rnorm, mean = 10, sd = 5)
x$key <- ikey(x[c("id_1","id_2","month","year")])
x$key <- as.character(x$key)
summary <- ffdfdply(x, split=x$key, FUN=function(df) {
df <- data.table(df)
df <- df[, list(
id_1 = id_1[1],
id_2 = id_2[1],
month = month[1],
year = year[1],
withdraw = sum(Amount*(Amount>0), na.rm=T)
), by = key]
df
}, trace=TRUE)
Another reason might be that you have too much other data in RAM which you are not talking about. Mark also that in ff, all your factor levels are in RAM, this might also be an issue if you are working with a lot of character/factor data - in that case you need to be asking yourself whether you really need these data in your analysis or not.