pythonscikit-learnregressioncross-validation

sklearn stratified k-fold CV with linear model like ElasticNetCV


using cross validation (CV) with sklearn is quite easy and straight-forward. But the default implementation when setting cv=5 in a linear CV model, like ElasticNetCV or LassoCV is a KFold CV. For various reasons I'd like to use a StratifiedKFold. From the documentation, it seems like any CV method can be given with cv=.

Passing cv=KFold(5) works as expected, but cv=StratifiedKFold(5) raises the Error:

ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.

I know that I can use cross_val_score after fitting, but I'd like to pass StratifiedKFold as CV directly to the linear model.

My minimum working example is:

from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np

x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.arange(100) + np.random.rand(100)

# KFold default implementation:
model_default = ElasticNetCV(cv=5)
model_default.fit(x, y)  # works fine
# KFold given as cv explicitly:
model_kfexp = ElasticNetCV(cv=KFold(5))
model_kfexp.fit(x, y)  # also works fine

# StratifiedKFold given as cv explicitly:
model_skf = ElasticNetCV(cv=StratifiedKFold(5))
model_skf.fit(x, y)  # THIS RAISES THE ERROR

Any idea how I can set StratifiedKFold as CV directly?


Solution

  • The root of your problem is this line:

    y = np.arange(100) + np.random.rand(100)
    

    StratifiedKFold cannot sample from continuous distribution hence your error. Try changing this line and your code will execute happily:

    from sklearn.linear_model import ElasticNetCV
    from sklearn.model_selection import KFold, StratifiedKFold
    import numpy as np
    
    x = np.arange(100, dtype=np.float64).reshape(-1, 1)
    y = np.random.choice([0,1], size=100)
    
    # KFold default implementation:
    model_default = ElasticNetCV(cv=5)
    model_default.fit(x, y)  # works fine
    # KFold given as cv explicitly:
    model_kfexp = ElasticNetCV(cv=KFold(5))
    model_kfexp.fit(x, y)  # also works fine
    
    # StratifiedKFold given as cv explicitly:
    model_skf = ElasticNetCV(cv=StratifiedKFold(5))
    model_skf.fit(x, y)  # no ERROR
    

    NOTE

    If you sample on continuous data, use KFold. If your target is categorical you may use both KFold and StratifiedKFold whichever suits your needs.

    NOTE 2

    If you insist on emulating stratified sampling on continuous data, you may wish to apply pandas.cut to your data, then do stratified sampling on that data, and finally pass resulting (train_id, test_id) generator to cv param:

    x = np.arange(100, dtype=np.float64).reshape(-1, 1)
    y = np.arange(100) + np.random.rand(100)
    
    y_cat = pd.cut(y, 10, labels=range(10))
    skf_gen = StratifiedKFold(5).split(x, y_cat)
    
    model_skf = ElasticNetCV(cv=skf_gen)
    model_skf.fit(x, y)  # no ERROR
    

    You can also use pandas.qcut for quantile-based discretization.