r parallel-processing random-forest doparallel parallel-foreach

How can I speed up the training of my random forest?

I'm trying to train several random forests (for regression) to have them compete and see which feature selection and which parameters give the best model.

However the trainings seem to take an insane amount of time, and I'm wondering if I'm doing something wrong.

The dataset I'm using for training (called train below) has 217k lines, and 58 columns (of which only 21 serve as predictors in the random forest. They're all numeric or integer, with the exception of a boolean one, which is of class character. The y output is numeric).

I ran the following code four times, giving the values 4, 100, 500, 2000 to nb_trees :

library("randomForest")
nb_trees <- #this changes with each test, see above
ptm <- proc.time()
fit <- randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 
    + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 
    + x20 + x21, 
    data = train, 
    ntree = nb_trees, 
    do.trace=TRUE)
proc.time() - ptm

Here is how long each of them took to train :

nb_trees | time
4          4mn
100        1h 41mn
500        8h 40mn
2000       34h 26mn

As my company's server has 12 cores and 125Go of RAM, I figured I could try to parallelize the training, following this answer (however, I used the doParallel package because it seemed to be running forever with doSNOW, I don't know why. And I can't find where I saw that doParallel would work too, sorry).

library("randomForest")
library("foreach")
library("doParallel")
nb_trees <- #this changes with each test, see table below
nb_cores <- #this changes with each test, see table below
cl <- makeCluster(nb_cores)
registerDoParallel(cl)
ptm <- proc.time()
fit <- foreach(ntree = rep(nb_trees, nb_cores), .combine = combine, .packages = "randomForest") 
    %dopar% {
        randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 
        + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 
        + x20 + x21,
        data = train, 
        ntree = ntree,
        do.trace=TRUE)}
proc.time() - ptm
stopCluster(cl)

When I run it, it takes a shorter time than non-parallelized code :

nb_trees | nb_cores | total number of trees              | time
1          4          4                                    2mn13s
10         10         100                                  52mn
9          12         108 (closest to 100 with 12 cores)   59mn
42         12         504 (closest to 500 with 12 cores)   I won't be running this one
167        12         2004 (closest to 2000 with 12 cores) I'll run it next week-end

However, I think it's still taking a lot of time, isn't it ? I'm aware it takes time to combine the trees into the final forest, so I didn't expect it to be 12 times faster with 12 cores, but it's only ~2 times faster...

Is this normal ?
If it isn't, is there anything I can do with my data and/or my code to radically decrease the running time ?
If not, should I tell the guy in charge of the server that it should be much faster ?

Thanks for your answers.

Notes :

I'm the only one using this server
for my next tests, I'll get rid of the columns that are not used in the random forest
I realized quite late that I could improve the running time by calling randomForest(predictors,decision) instead of randomForest(decision~.,data=input), and I'll be doing it from now on, but I think my questions above still holds.

Solution

While I'm a fan of brute force techniques, such as parallelization or running a code for an extremely long time, I am an even bigger fan of improving an algorithm to avoid having to use a brute force technique.

While training your random forest using 2000 trees was starting to get prohibitively expensive, training with a smaller number of trees took a more reasonable time. For starters, you can train with say 4, 8, 16, 32, ..., 256, 512 trees and carefully observe metrics which let you know how robust the model is. These metrics include things like the best constant model (how well your forest performs on the data set versus a model which predicts the median for all inputs), as well as the out-of-bag error. In addition, you can observe the top predictors and their importance, and whether you start to see a convergence there as you add more trees.

Ideally, you should not have to use thousands of trees to build a model. Once your model begins to converge, adding more trees won't necessarily worsen the model, but at the same time it won't add any new information. By avoiding using too many trees you may be able to cut down a calculation which would have taken on the order of a week to less than a day. If, on top of this, you leverage a dozen CPU cores, then you might be looking at something on the order of hours.

To look at variable importance after each random forest run, you can try something along the lines of the following:

fit <- randomForest(...)
round(importance(fit), 2)

It is my understanding that the first say 5-10 predictors have the greatest impact on the model. If you notice that by increasing trees these top predictors don't really change position relative to each other, and the importance metrics seem to stay the same, then you might want to consider not using so many trees.