pythonscikit-learnh2oautoml

H2O sklearn wrapper: how to get H2OAutoML object out of it and run explain()?


I am using h2o automl library from python with scikit-learn wrapper to create a pipeline for training my model. I follow this example, recommended by the official documentation:

from sklearn import datasets
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

from h2o.sklearn import H2OAutoMLClassifier


X_classes_train, X_classes_test, y_classes_train, y_classes_test = train_test_split(X_classes, y_classes, test_size=0.33)

pipeline = Pipeline([
    ('polyfeat', PolynomialFeatures(degree=2)),
    ('featselect', SelectKBest(f_classif, k=5)),
    ('classifier', H2OAutoMLClassifier(max_models=10, seed=2022, sort_metric='logloss'))
])

pipeline.fit(X_classes_train, y_classes_train)
preds = pipeline.predict(X_classes_test)

So, I've trained my pipeline/model, now I want to get an H2OAutoML object out of H2OAutoMLClassifier wrapper to invoke .explain() method on it and get some insight about the features and models.

How do I do that?


Solution

  • There's no easy way to use .explain() on sklearn's pipeline. You can extract the H2OAutoML's leader model (the best model trained in the AutoML) and on that you could call the .explain().

    For .explain() to work you'll need an H2OFrame with the same features as was used to train the model and that's the problem for both interpretability and ease of use. You will need to create the dataset using the first 2 steps in the pipeline (in your example polyfeat and featselect). This alone will make it very hard to interpret - the columns will get names like C1, C2, ...

    You can do the things I described using the following code:

    transformed_df = X_classes_test
    
    num_of_steps = len(pipeline.steps)
    
    # Transform the data using the pipeline
    for i in range(num_of_steps - 1):
        transformed_df = pipeline.steps[i][1].transform(transformed_df)
        
    # Create the H2OFrame
    h2o_frame = h2o.H2OFrame(transformed_df)
    h2o_frame.columns = [c for c in pipeline.steps[num_of_steps - 1][1].estimator.leader._model_json["output"]["names"] 
                         if c != pipeline.steps[num_of_steps - 1][1].estimator.leader.actual_params["response_column"]]    
    # Add the response column
    h2o_frame = h2o_frame.cbind(h2o.H2OFrame(y_classes_test.to_frame()))
    h2o_frame.set_name(h2o_frame.shape[1]-1, pipeline.steps[num_of_steps - 1][1].estimator.leader.actual_params["response_column"])
    
    # Run the .explain()
    pipeline.steps[num_of_steps - 1][1].estimator.leader.explain(h2o_frame)
    

    However, I'd recommend another approach - if you need interpretability and do not need to cross-validate the whole pipeline. Use the first N-1 steps of the pipeline to create a data frame, add appropriate column names to the newly created data frame and then run h2o AutoML using the h2o api. This will make it easier to use .explain() and other interpretability related methods and you will have column names with actual meaning rather than just names based on column order.