rperformancedataframecategorical-datadummy-variable

R factor function running slow with long dataframe


I have a long dataframe (many millions of rows, several columns). For running fixed effects regressions, I want to declare categorical variables as factors using the factor function, but this is very slow. I am looking for a potential solution to speed it up.

My code is as follows:

library(lfe)
my_data=read.csv("path_to//data.csv")
attach(data.frame(my_data))

and the following is the very slow line:

my_data$col <- factor(my_data$col)

Solution

  • If you know the levels of the factor you are creating, this can speed things up quite a bit. Observe:

    library(microbenchmark)
    set.seed(237)
    test <- sample(letters, 10^7, replace = TRUE)
    microbenchmark(noLevels = factor(test), withLevels = factor(test, levels = letters), times = 20)
    Unit: milliseconds
          expr      min       lq     mean   median       uq      max neval cld
      noLevels 523.6078 545.3156 653.4833 696.4768 715.9026 862.2155    20   b
    withLevels 248.6904 270.3233 325.0762 291.6915 345.7774 534.2473    20  a 
    

    And to get the levels for the OP's situation, we simply call unique.

    myLevels <- unique(my_data$col)
    my_data$col <- factor(my_data$col, levels = myLevels)
    

    There is also an Rcpp offering written by Kevin Ushley (Fast factor generation with Rcpp). I modified the code a little assuming a situation where one would know the levels a priori. The function from the referenced website is RcppNoLevs and the modified Rcpp function is RcppWithLevs in the benchmarking below.

    microbenchmark(noLevels = factor(test),
                   withLevels = factor(test, levels = letters),
                   RcppNoLevs = fast_factor(test),
                   RcppWithLevs = fast_factor_Levs(test, letters), times = 20)
    Unit: milliseconds
            expr      min       lq     mean   median       uq       max neval  cld
        noLevels 571.5482 609.6640 672.1249 645.4434 704.4402 1032.7595    20    d
      withLevels 275.0570 294.5768 318.7556 309.2982 342.8374  383.8741    20   c 
      RcppNoLevs 189.5656 203.3362 213.2624 206.9281 215.6863  292.8997    20  b  
    RcppWithLevs 105.7902 111.8863 120.0000 117.9411 122.8043  173.8130    20 a   
    

    Here is the modified Rcpp function that assumes one is passing the levels as an argument:

    #include <Rcpp.h>
    using namespace Rcpp;
    
    template <int RTYPE>
    IntegerVector fast_factor_template_Levs( const Vector<RTYPE>& x, const Vector<RTYPE>& levs) {
        IntegerVector out = match(x, levs);
        out.attr("levels") = as<CharacterVector>(levs);
        out.attr("class") = "factor";
        return out;
    }
    
    // [[Rcpp::export]]
    SEXP fast_factor_Levs( SEXP x, SEXP levs) {
        switch( TYPEOF(x) ) {
        case INTSXP: return fast_factor_template_Levs<INTSXP>(x, levs);
        case REALSXP: return fast_factor_template_Levs<REALSXP>(x, levs);
        case STRSXP: return fast_factor_template_Levs<STRSXP>(x, levs);
        }
        return R_NilValue;
    }