pythonscikit-learndata-sciencexgboostimbalanced-data

Sklearn Pipelines + GridsearchCV + XGBoost + Learning Curve


I am new to sklearn & XGBoost. I would like to use GridSearchCV to tune a XGBoost classifier. One of the checks that I would like to do is the graphical analysis of the loss from train and test. So far I have created the following code:

# Create a new instance of the classifier
xgbr =  xgb.XGBClassifier()
# Create a new pipeline with preprocessing steps and model (imballanced learn)
pipeline  = imb_pipeline([
                          ('preprocess', preprocess), # Encode and transform categorical variables
                          ('re-sample', samplers[0]), # re-samples data to ballanced state
                          ('scale', scalers[0]), # scales the data
                          ('model', xgbr), # models
                          ])

# Create parameter values for gridsearch - carefull, "model__" prepended defined in pipeline
params = { 
    'model__max_depth': [3, 4, 5, 6, 8, 10, 12, 15],
    'model__learning_rate': [0.001, 0.01, 0.1, 0.20, 0.25, 0.30],
    "model__gamma":[0, 0.25, 0.5, 0.75,1],
    'model__n_estimators': [100, 500, 1000],
    "model__subsample":[0.9],
    "model__colsample_bytree":[0.5],
    "model__early_stopping_rounds": [10], 
    "model__random_state": [random_state], 
    "model__eval_metric" : ["error"], 
    "model__eval_set" : [[(X_train, Y_train), (X_test,Y_test)]]
}

# Use GridSearchCV for all combinations
grid = GridSearchCV(
    estimator = pipeline,
    param_grid = params,
    scoring = 'roc_auc',
    n_jobs = -1,
    cv = 5,
    verbose = 3,
)

# Model fitting
grid.fit(X_train, Y_train)

I have create in params a key-value pair for eval_metric and eval_set:
My question is now, how to access those values and plot a curve of train and test loss (sorry I cannot post a figure here). Another question: Are the values hand-over by eval_set also piped by the pipeline or do I have to create a separate pipeline for those?

I am using xgb.__version == 0.90, sklearn.__version__ == 1.0.2, python == 3.7.13 @ (google colab)


Solution

  • I think you are misunderstanding how the grid search is coupled to the cross-validation. Here your training set will be partitioned in 5 (cv = 5) almost even chunks, for each value of the hyperparameter grid it will train on 4 and predict (+ eval) on on the last one, iterating on all possible splits. This will give a cv estimate of the error. But this is all happening inside your training data. After picking the hyperparameters giving the best (=smallest) error, you want to evaluate this model (trained on the entire training set) on the evaluation data (held out and totally new data to the model and hyperparameters. This will give you the a reliable estimate of the generalization error of your model.

    Also if you use

        "model__eval_set" : [[(X_train, Y_train), (X_test,Y_test)]]
    

    the grid will assume that this is a binary parameter to optimize on, and you do not want that.

    All in all you are looking for something of the like:

    
    params = { 
        'model__max_depth': [3, 4, 5, 6, 8, 10, 12, 15],
        'model__learning_rate': [0.001, 0.01, 0.1, 0.20, 0.25, 0.30],
        "model__gamma":[0, 0.25, 0.5, 0.75,1],
        'model__n_estimators': [100, 500, 1000],
        "model__subsample":[0.9],
        "model__colsample_bytree":[0.5],
        "model__early_stopping_rounds": [10], 
        "model__random_state": [random_state], 
    }
    
    grid = GridSearchCV(
        estimator = pipeline,
        param_grid = params,
        scoring = 'roc_auc',
        n_jobs = -1,
        cv = 5,
        verbose = 3,
    )
    
    # Model fitting
    grid = grid.fit(X_train, Y_train, eval_set=[(X_test, Y_test)])
    
    eval_auc = sklearn.metrics.roc_auc_score( Y_test, grid.best_estimator_.predict_proba(X_test)[:,1])