rdplyrffffbase

Why summarise in ffbase2 (dplyr_ffbase) shows "error in as.vmode.default() (list) object cannot be coerced to type 'double'"?


I have a large (23 Mln rows) ffdf table (tbl_ffdf) with 10 columns, 7 of them are factors and 3 contain numbers. It looks something like this:

TABLE_bad

   F1     F2     F3     F4     F5     F6     F7     N1     N2     N3
 1111  01.15  05.14  busns     AA     16      F   55.2  16165      0
 1111  01.15  05.14  busns     AA     16      F   12.5      0   4545
 2222  12.14  11.14  privt     KM      5      T    0.7    255 987777
 2222  12.14  11.14  privt     KM      5      T  111.6   7800      0

I'd like to aggregate the data with sum(Nx) to remove this kind of duplicates and make my table look like this:

TABLE_ok

   F1     F2     F3     F4     F5     F6     F7     N1     N2     N3
 1111  01.15  05.14  busns     AA     16      F   57.7  16165   4545
 2222  12.14  11.14  privt     KM      5      T  112.3   8055 987777

I'm using package ffbase2 installed from github (which is dplyr for ffdf tables). I'm doing following:

TABLE_gr <- group_by(TABLE_bad, F1, F2, F3, F4, F5, F6, F7)    # this step finishes OK
                                                               # in approximately 90 sec

TABLE_ok <- summarise(TABLE_gr, sN1 = sum(N1), sN2 = sum(N2), sN3 = sum(N3))

and after that it works ~ 10 sec and says

Error in as.vmode.default(value, vmode) : 
  (list) object cannot be coerced to type 'double'

after that it goes in debug mode accordingly to the settings in my Rstudio, and it takes him ~ 3-5 MINUTES to go deep enough, stop hanging computer and show code of fuction which made error:

function (x, ...) 
UseMethod("as.vmode")

Here in Data we can see that x is data.frame of F1 values. And in Traceback - functions

eval(expr, envir, enclose)
`[<-`(`*tmp*`, ff::hi(N + 1, N + n), , value = -*etc*-
append_to(out, res, -*etc*-
summarise_.grouped_ffdf( -*etc*-

Watching into source code of ffbase2 gave me not much... I've got something like method summarise_.grouped_ffdf uses recursive slicing of data and, probably, on last step it gets some data.frame but wanted to get a matrix?.. it's a usual reason of "(list) object cannot be coerced to type 'double'" error.

I have no idea what is the real reason of this error and how to fix it. Help please! :-)


Solution

  • Today I've found what was the matter of the error. The part of source code of summarise_.grouped_ffdf looks like this:

    42   for (i in grouped_chunks(.data)){
    43     ch <- grouped_df(data_s[i,,drop=FALSE], groups(.data))
    44     res <- summarise_(ch, .dots = dots)
    45     out <- append_to(out, res, check_structure=FALSE)
    46   }
    

    This function cuts data into pieces according to groups (line 43) and applies usual dplyr summarise to them (line 44). Then it appends the result to the output variable. But looking into source of append_to shows us that for correct appending variable res must be a tbl_ffdf object, but here we have simple data.frame. So, modifying the line 45 of the file manip-grouped-ffdf.r in the following way completely solves the problem:

    45     out <- append_to(out, tbl_ffdf(res), check_structure=FALSE) 
    

    That's very nice, but after that I had running out-of-memory problems when using this summarise. Investigation lead to the fact it's because of grouped_chunks(.data). I didn't dig why it's so and what to do here, i just made month-by-month slicing of my data in for loop, with appending aggregated chunks to each other after that.