[SOLVED] How to handle class imbalance with H2O AutoML

How to handle class imbalance with H2O AutoML

I'm using H2O AutoML to do binary classification, and the classes are imbalanced.

I've set balance_classes = TRUE and max_after_balance_size = 100 in h2o.automl() function to oversample the minority class. However, the metric "area under the Precision-Recall curve (AUCPR)" of the leader model is not very good, ~ 0.10.

May I ask, are there any tips (e.g., preprocessing steps, parameter setting in h2o.automl()) to handle the class imbalance problem with H2O AutoML?

Your kind guidance is much appreciated!

Solution

I would recommend specifying stopping_metric = "AUCPR" to optimize for AUCPR and sort_metric = "AUCPR" to let AutoML know that the leader model should be the one with the best AUCPR (otherwise it would use AUC by default).

If your data is small enough, you might be able to use libraries like imbalanced-learn in python or themis in R to do some preprocessing like SMOTE, removing Tomek links etc.