I have a big data with 12 columns and 600000 rows, and I want to replace outliers with this function
replace_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25,.50 ,.75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x > (qnt[3] + H)] <- qnt[2]
y
}
But with a for loop it gonna take to much time, can i do this way faster without a better hardware or a cluster ?
There are a few ways of optimizing the function, but as your question stands, the operation isn't that slow.
Anyway, without resorting to data.table
, dplyr
, or parallel programming, we can still get a modest speed increase by simply rewriting your function to
replace_outliers2 = function(x, na.rm = TRUE, ...) {
qnt = quantile(x, probs=c(.25,.50 ,.75), na.rm = na.rm, ...)
x[x > (2.5*qnt[3]- 1.5*qnt[1])] = qnt[2]
x
}
Some quick timings:
R> x = matrix(rlnorm(600000*12), ncol=12)
R> system.time({for(i in 1:12) replace_outliers(x[,i])})
user system elapsed
1.448 0.008 1.469
R> system.time({ for(i in 1:12) replace_outliers2(x[,i])})
user system elapsed
0.860 0.004 0.869