pandasdataframeh2oisolation-forest

H2O | ExtendedIsolation Forest | model.explain() gives, KeyError: 'response_column'


I have been struggling with this error for a few hours now, but seem lost even after reading through the documentation.

I'm using H2O's Extended Isolation Forest (EIF), an unsupervised model, to detect anomalies in an unlabelled dataset. Which is working as intended, however for the project i'm working on the model explainability is extremely important. I discovered the explain function, which supposedly returns several explainablity methods for a model. I'm particularly interested in the SHAP values from this function.

The documentation states

The main functions, h2o.explain() (global explanation) and h2o.explain_row() (local explanation) work for individual H2O models, as well a list of models or an H2O AutoML object. The h2o.explain() function generates a list of explanations.

Since the H2O models link brings me to a page which covers both supervised and unsupervised models I assume the explain function would work for both types of models.

When trying to run my code the following code works just fine.

import h2o
from h2o.estimators import H2OExtendedIsolationForestEstimator

h2o.init()
df_EIF = h2o.H2OFrame(df_EIF)
predictors = df_EIF.columns[0:37]

eif = H2OExtendedIsolationForestEstimator(ntrees = 75, sample_size =500, extension_level = (len(predictors) -1)  )

eif.train(x=predictors, training_frame = df_EIF)
eif_result = eif.predict(df_EIF)
df_EIF['anomaly_score_EIF') = eif_result['anomaly_score']

However when trying to call explain over the model (eif)

eif.explain(df_EIF)

Gives me the following KeyError

KeyError                                  Traceback (most recent call last)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.py in <module>
----> 1 eif.explain(df_EIF)
      2 
      3 
      4 
      5 

C:\ProgramData\Anaconda3\lib\site-packages\h2o\explanation\_explain.py in explain(models, frame, columns, top_n_features, include_explanations, exclude_explanations, plot_overrides, figsize, render, qualitative_colormap, sequential_colormap)
   2895     plt = get_matplotlib_pyplot(False, raise_if_not_available=True)
   2896     (is_aml, models_to_show, classification, multinomial_classification, multiple_models, targets,
-> 2897      tree_models_to_show, models_with_varimp) = _process_models_input(models, frame)
   2898 
   2899     if top_n_features < 0:

C:\ProgramData\Anaconda3\lib\site-packages\h2o\explanation\_explain.py in _process_models_input(models, frame)
   2802         models_with_varimp = [model for model in models if _has_varimp(model)]
   2803     tree_models_to_show = _get_tree_models(models, 1 if is_aml else float("inf"))
-> 2804     y = _get_xy(models_to_show[0])[1]
   2805     classification = frame[y].isfactor()[0]
   2806     multinomial_classification = classification and frame[y].nlevels()[0] > 2

C:\ProgramData\Anaconda3\lib\site-packages\h2o\explanation\_explain.py in _get_xy(model)
   1790     """
   1791     names = model._model_json["output"]["original_names"] or model._model_json["output"]["names"]
-> 1792     y = model.actual_params["response_column"]
   1793     not_x = [
   1794                 y,

KeyError: 'response_column

From my understanding this response column refers to a column that you are trying to predict. However, since i'm dealing with an unlabelled dataset this response column doesn't exist. Is there a way for me to bypass this error? Is it even possible to utilize the explain() function on unsupervised models? If, so how do I do this? If it is not possible, is there another way to extract the Shap values of each variable from the model? Since the shap.TreeExplainer also doesn't seem to work on a H2O model.

TL;DR: Is it possible to use the .explain() function from h2o on an (Extended) Isolation forest? If so how?


Solution

  • Unfortunately, the explain method in H2O-3 is supported only for the supervised algorithms.

    What you could do is to use a surrogate model and look at explanations on it. Basically, you'd fit a GBM (or DRF as those 2 models support the TreeSHAP) on the data + the prediction of the Extended Isolation Forest which would be the response.