machine-learningscikit-learnoverfitting-underfitting

scikit-learn cross-validation over-fitting or under-fitting


I'm using scikit-learn cross_validation and get for example 0.82 mean score (r2_scorer). How could I know do I have over-fitting or under-fitting using scikit-learn functions?


Solution

  • Unfortunately I confirm that there is no built-in tool to compare train and test scores in a CV setup. The cross_val_score tool only reports test scores.

    You can setup your own loop with the train_test_split function as in Ando's answer but you can also use any other CV scheme.

    import numpy as np
    from sklearn.cross_validation import KFold
    from sklearn.metrics import SCORERS
    
    scorer = SCORERS['r2']
    cv = KFold(5)
    train_scores, test_scores = [], []
    for train, test in cv:
        regressor.fit(X[train], y[train])
        train_scores.append(scorer(regressor, X[train], y[train]))
        test_scores.append(scorer(regressor, X[test], y[test]))
    
    mean_train_score = np.mean(train_scores)
    mean_test_score = np.mean(test_scores)
    

    If you compute the mean train and test scores with cross validation you can then find out if you are:

    Note: you can be both significantly underfitting and overfitting at the same time if your model is inadequate and your data is too noisy.