pythonmachine-learningscikit-learncross-validationoverfitting-underfitting

How to test for overfitting in regression cross-validation with GridSearchCV?


I am running a regression model of a set of continuous variables and a continuous target. This is my code:

def run_RandomForest(xTrain,yTrain,xTest,yTest):
  cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

  # define the pipeline to evaluate
  model = RandomForestRegressor()
  fs = SelectKBest(score_func=mutual_info_regression)
  pipeline = Pipeline(steps=[('sel',fs), ('rf', model)])

  # define the grid
  grid = dict()
  grid['sel__k'] = [i for i in range(1, xTrain.shape[1]+1)]
  search = GridSearchCV(
        pipeline,
        param_grid={
            'rf__bootstrap': [True, False],
            'rf__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
            'rf__max_features': ['auto', 'sqrt'],
            'rf__min_samples_leaf': [1, 2, 4],
            'rf__min_samples_split': [2, 5, 10],
            'rf__n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
        },
        scoring='neg_mean_squared_error',
        return_train_score=True,
        verbose=1,
        cv=5,
        n_jobs=-1)

  # perform the fitting
  results = search.fit(xTrain, yTrain)

  # predict prices of X_test
  y_pred = results.predict(xTest)

run_RandomForest(x_train,y_train,x_test_y_test)

I want to understand if this model is over-fitting. I read that incorporating cross-validation is an effective way to check this.

You can see I've incorporated cv into the code above. However, I'm totally stuck on the next step. Can someone demonstrate to me the code that will take the cv information, and produce either a plot or set of statistics that I'm meant to analyse for over-fitting? I know there are some questions like this on SO (e.g. here and here), but i'm not understanding from either of these how specifically to translate to my situation, because in both of these examples, they just initialise a model and fit it, whereas mine incorporates GridSearchCV?


Solution

  • You can certainly tune the hyperparameters that control the number of features that are randomly chosen to grow each tree from the bootstrapped data. Typically, you do this via k-fold cross-validation; choose the tuning parameter that minimizes test sample prediction error. In addition, growing a larger forest will improve predictive accuracy, although there are usually diminishing returns once you get up to several hundreds of trees.

    Try this sample code.

    from sklearn.ensemble import RandomForestRegressor
    rf = RandomForestRegressor(random_state = 42)
    from pprint import pprint # Look at parameters used by our current forest
    
    print(rf.get_params())
    

    Result:

    {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
    

    Also...

    import numpy as np
    from sklearn.model_selection import RandomizedSearchCV # Number of trees in random forest
    n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
    # Number of features to consider at every split
    max_features = ['auto', 'sqrt']
    # Maximum number of levels in tree
    max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
    max_depth.append(None)
    # Minimum number of samples required to split a node
    min_samples_split = [2, 5, 10]
    # Minimum number of samples required at each leaf node
    min_samples_leaf = [1, 2, 4]
    # Method of selecting samples for training each tree
    bootstrap = [True, False]# Create the random grid
    random_grid = {'n_estimators': n_estimators,
                   'max_features': max_features,
                   'max_depth': max_depth,
                   'min_samples_split': min_samples_split,
                   'min_samples_leaf': min_samples_leaf,
                   'bootstrap': bootstrap}
    pprint(random_grid)
    

    Result:

    {'bootstrap': [True, False],
     'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
     'max_features': ['auto', 'sqrt'],
     'min_samples_leaf': [1, 2, 4],
     'min_samples_split': [2, 5, 10],
     'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
    

    See this link for more info.

    https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6

    Here is some sample code to do Cross Validation.

    # import random search, random forest, iris data, and distributions
    from sklearn.model_selection import cross_validate
    from sklearn import datasets
    from sklearn.ensemble import RandomForestClassifier
    
    # get iris data
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target
    
    
    model = RandomForestClassifier(random_state=1)
    cv = cross_validate(model, X, y, cv=5)
    print(cv)
    print(cv['test_score'])
    print(cv['test_score'].mean())
    

    Result:

    {'fit_time': array([0.18350697, 0.14461398, 0.14261866, 0.13116884, 0.15478826]), 'score_time': array([0.01496148, 0.00997281, 0.00897574, 0.00797844, 0.01396227]), 'test_score': array([0.96666667, 0.96666667, 0.93333333, 0.96666667, 1.        ])}
    [0.96666667 0.96666667 0.93333333 0.96666667 1.        ]
    0.9666666666666668
    

    Inner Working of Cross Validation:

    Shuffle the dataset in order to remove any kind of order
    Split the data into K number of folds. K= 5 or 10 will work for most of the cases
    Now keep one fold for testing and remaining all the folds for training
    Train(fit) the model on train set and test(evaluate) it on test set and note down the results for that split
    Now repeat this process for all the folds, every time choosing separate fold as test data
    So for every iteration our model gets trained and tested on different sets of data
    At the end sum up the scores from each split and get the mean score