pythonmachine-learningregressioncross-validationgrid-search

Hurdle models - gridsearchCV


I am currently trying to build a hurdle model - zero inflated regressor to predict the revenue from each of out customers.

We use zero inflated regressor because most (80%) of our customers have 0 as revenue and only 20% have revenue > 0.

So, we build two models like as shown below

zir = ZeroInflatedRegressor(
    classifier=ExtraTreesClassifier(),
    regressor=RandomForestRegressor()
)

And I do gridsearchCV to improve the performance of our model. So, I do the below

from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(
    estimator=zir,
    param_grid={
        'classifier__n_estimators': [100,200,300,400,500],
        'classifier__bootstrap':[True, False],
        'classifier__max_features': ['sqrt','log2',None],
        'classifier__max_depth':[2,4,6,8,None],
        'regressor__n_estimators': [100,200,300,400,500],
        'regressor__bootstrap':[True, False],
        'regressor__max_features': ['sqrt','log2',None],
        'regressor__max_depth':[2,4,6,8,None]  
    },
    scoring = 'neg_mean_squared_error'
)

Now my question is on how does gridsearchCV work in the case of hurdle models?

Does hyperparameters from classifier combine with regressor as well to generate a pair? Or only hypaprameters within the same model type combine to generate new pairs?

Put simply, would classifier have 150 combinations of hyperparameters and regressor seperately have 150?


Solution

  • In your code snippet, there are 150*150 hyperparameter combinations to try. (You can check this easily by starting to fit; it will print out the number of model fittings.) This is just how GridSearchCV works, not anything specific to ZeroInflatedRegressor.

    If you want different behavior, you can wrap the individual estimators in grid searches. For example,

    clf = GridSearchCV(
        estimator=ExtraTreesClassifier(),
        param_grid={
            'classifier__n_estimators': [100,200,300,400,500],
            'classifier__bootstrap':[True, False],
            'classifier__max_features': ['sqrt','log2',None],
            'classifier__max_depth':[2,4,6,8,None],
        },
        scoring='roc_auc',
    )
    
    reg = GridSearchCV(
        estimator=RandomForestRegressor(),
        param_grid={
            'regressor__n_estimators': [100,200,300,400,500],
            'regressor__bootstrap':[True, False],
            'regressor__max_features': ['sqrt','log2',None],
            'regressor__max_depth':[2,4,6,8,None],
        },
        scoring = 'neg_mean_squared_error',
    )
           
    zir = ZeroInflatedRegressor(
        classifier=clf,
        regressor=reg,
    )
    

    Now we need to know a bit more about the ZeroInflatedRegressor. It fits its classifier on all the data with target "is it nonzero?"; in this case, that's a grid search, so we'll search the 150 candidate hyperparameter combinations, choosing the one that performs best in terms of ROC AUC. Then among the nonzero (predicted) datapoints it fits the regressor, and now again that's 150 hyperparameter points selecting for optimal MSE.

    So this version will be much faster, in exchange for less optimality: you optimize the classifier for ROC AUC, not for how it works with the regressor's predictions and final MSE.