machine-learningscikit-learnrandom-forest

Subsample size in scikit-learn RandomForestClassifier


How is it possible to control the size of the subsample used for the training of each tree in the forest? According to the documentation of scikit-learn:

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

So bootstrap allows randomness but can't find how to control the number of subsample.


Solution

  • Scikit-learn doesn't provide this, but you can easily get this option by using (slower) version using combination of tree and bagging meta-classifier:

    from sklearn.ensemble import BaggingClassifier
    from sklearn.tree import DecisionTreeClassifier
    
    clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=0.5)
    

    As a side-note, Breiman's random forest indeed doesn't consider subsample as a parameter, completely relying on bootstrap, so approximately (1 - 1 / e) of samples are used to build each tree.