I have a large (23 Mln rows) ffdf table (tbl_ffdf) with 10 columns, 7 of them are factors and 3 contain numbers. It looks something like this:
TABLE_bad
F1 F2 F3 F4 F5 F6 F7 N1 N2 N3
1111 01.15 05.14 busns AA 16 F 55.2 16165 0
1111 01.15 05.14 busns AA 16 F 12.5 0 4545
2222 12.14 11.14 privt KM 5 T 0.7 255 987777
2222 12.14 11.14 privt KM 5 T 111.6 7800 0
I'd like to aggregate the data with sum(Nx) to remove this kind of duplicates and make my table look like this:
TABLE_ok
F1 F2 F3 F4 F5 F6 F7 N1 N2 N3
1111 01.15 05.14 busns AA 16 F 57.7 16165 4545
2222 12.14 11.14 privt KM 5 T 112.3 8055 987777
I'm using package ffbase2 installed from github (which is dplyr for ffdf tables). I'm doing following:
TABLE_gr <- group_by(TABLE_bad, F1, F2, F3, F4, F5, F6, F7) # this step finishes OK
# in approximately 90 sec
TABLE_ok <- summarise(TABLE_gr, sN1 = sum(N1), sN2 = sum(N2), sN3 = sum(N3))
and after that it works ~ 10 sec and says
Error in as.vmode.default(value, vmode) :
(list) object cannot be coerced to type 'double'
after that it goes in debug mode accordingly to the settings in my Rstudio, and it takes him ~ 3-5 MINUTES to go deep enough, stop hanging computer and show code of fuction which made error:
function (x, ...)
UseMethod("as.vmode")
Here in Data we can see that x is data.frame of F1 values. And in Traceback - functions
eval(expr, envir, enclose)
`[<-`(`*tmp*`, ff::hi(N + 1, N + n), , value = -*etc*-
append_to(out, res, -*etc*-
summarise_.grouped_ffdf( -*etc*-
Watching into source code of ffbase2 gave me not much... I've got something like method summarise_.grouped_ffdf uses recursive slicing of data and, probably, on last step it gets some data.frame but wanted to get a matrix?.. it's a usual reason of "(list) object cannot be coerced to type 'double'" error.
I have no idea what is the real reason of this error and how to fix it. Help please! :-)
Today I've found what was the matter of the error. The part of source code of summarise_.grouped_ffdf
looks like this:
42 for (i in grouped_chunks(.data)){
43 ch <- grouped_df(data_s[i,,drop=FALSE], groups(.data))
44 res <- summarise_(ch, .dots = dots)
45 out <- append_to(out, res, check_structure=FALSE)
46 }
This function cuts data into pieces according to groups (line 43) and applies usual dplyr summarise to them (line 44). Then it appends the result to the output variable. But looking into source of append_to
shows us that for correct appending variable res
must be a tbl_ffdf
object, but here we have simple data.frame
. So, modifying the line 45 of the file manip-grouped-ffdf.r
in the following way completely solves the problem:
45 out <- append_to(out, tbl_ffdf(res), check_structure=FALSE)
That's very nice, but after that I had running out-of-memory problems when using this summarise. Investigation lead to the fact it's because of grouped_chunks(.data)
. I didn't dig why it's so and what to do here, i just made month-by-month slicing of my data in for loop, with appending aggregated chunks to each other after that.