pythonmachine-learningscikit-learnrandom-forestquantile-regression

RandomForestQuantileRegressor from scikit-garden .fit method freezes when training last tree


I've been working with scikit-garden for around 2 months now, trying to train quantile regression forests (QRF), similarly to the method in this paper. The authors of the paper used R, but because my collegues and I are already familiar with python, we decided to use the QRF implementation from scikit-garden. First of all, the package is in bad shape and doesn't seem to be fully functional (we had to change some of the source code to get it running in the first place). This is kind of my final attempt to get it working.

With all the code finished which is responsible for creating workable datasets, we are now trying to train a simple QRF with the standard hyperparameters to get a first estimate of the error. Until now, not a single training run has finished, as it always seems to stall / freeze while training the last tree, I've always had to kill the job myself to avoid annoying the sysadmin.

For example, my latest training run I conducted on 8 CPUs (each CPU trains 1 tree), the standard settings builds and trains 10 trees. All trees were built and trained within 5-6 minutes, except the last tree, which I let run for a week before I was forced to kill it. Important is that only one of the 8 CPUs reserved was active, and it was (apparently) running at 100%.

We have quite large datasets (~2'000'000 observations), but even with smaller excerpts, it still freezes on the last tree. It also makes little sense to me that all of the trees should train quickly on the full dataset except for the last one.

Here is a small excerpt of the main training code:

xtrain, xtest, ytrain, ytest = train_test_split(features, target, test_size=testsize)
model = RandomForestQuantileRegressor(verbose=2, n_jobs=-1).fit(xtrain, ytrain)

ypred = model.predict(xtest)

This is my first time posting a question here - if I've forgotten any important information just let me know! Thanks so much to anyone who can help me! :)


Solution

  • There is a fast, actively maintained QRF implementation that may work for your problem available here: https://github.com/zillow/quantile-forest

    A simple example of using the package, following your excerpt:

    from quantile_forest import RandomForestQuantileRegressor
    from sklearn import datasets
    from sklearn.model_selection import train_test_split
    
    X, y = datasets.fetch_california_housing(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    qrf = RandomForestQuantileRegressor(n_estimators=10)
    qrf.fit(X_train, y_train)
    
    y_pred = qrf.predict(X_test, quantiles=[0.025, 0.5, 0.975])