rperformanceoptimizationbigdataoutliers

Replace outliers from big data


I have a big data with 12 columns and 600000 rows, and I want to replace outliers with this function

 replace_outliers <- function(x, na.rm = TRUE, ...) {
    qnt <- quantile(x, probs=c(.25,.50 ,.75), na.rm = na.rm, ...)
    H <- 1.5 * IQR(x, na.rm = na.rm)
    y <- x
    y[x > (qnt[3] + H)] <-  qnt[2]
    y
 }

But with a for loop it gonna take to much time, can i do this way faster without a better hardware or a cluster ?


Solution

  • There are a few ways of optimizing the function, but as your question stands, the operation isn't that slow.

    Anyway, without resorting to data.table, dplyr, or parallel programming, we can still get a modest speed increase by simply rewriting your function to

    replace_outliers2 = function(x, na.rm = TRUE, ...) {
      qnt = quantile(x, probs=c(.25,.50 ,.75), na.rm = na.rm, ...)
      x[x > (2.5*qnt[3]- 1.5*qnt[1])] = qnt[2]
      x
    }
    

    Some quick timings:

    R> x = matrix(rlnorm(600000*12), ncol=12)
    R> system.time({for(i in 1:12) replace_outliers(x[,i])})
       user  system elapsed 
      1.448   0.008   1.469 
    R> system.time({ for(i in 1:12) replace_outliers2(x[,i])})
       user  system elapsed 
      0.860   0.004   0.869