I am currently using h2o autoML to train a model on a binary classification problem. I have a train (70% ~200k rows), valid (10% ~30k rows), test (10% ~30k rows) and blend (10% ~30k rows) datasets all coming from the time sensitive splitting of the original dataset (~300k rows).
When checking the training confusion matrix I only see 10k total cases instead of ~200k.
I create the model like this :
# Create model
aml = H2OAutoML(
max_runtime_secs=max_runtime_secs,
stopping_metric=stopping_metric, # "AUCPR"
sort_metric=sort_metric, # "AUCPR"
nfolds=nfolds, # set to 0
distribution=distribution, # "bernoulli"
verbosity=verbosity,
balance_classes=balance_classes, # False
seed=seed,
)
aml.train(
y=outcome_column,
training_frame=train,
validation_frame=valid,
leaderboard_frame=test,
blending_frame=blend,
)
# Get the best model
best_model = aml.get_best_model()
# get the performance on test
performance = best_model.model_performance(test)
# define the threshold based on the desired metric
best_threshold = best_model.find_threshold_by_max_metric(
metric=metric_to_use, valid=True)
# inspect confusion matrix on training set using that threshold
train_confusion = best_model.confusion_matrix(
thresholds=best_threshold, train=True)
# inspect confusion matrix on test using that threshold
test_confusion = performance.confusion_matrix(thresholds=best_threshold)
# confusion matrix validation using that threshold
valid_confusion = best_model.confusion_matrix(
thresholds=best_threshold, valid=True)
)
These are the resulting confusion matrix:
confusion matrix train: Confusion Matrix (Act/Pred) @ threshold = 0.35701837501784456
False True Error Rate
----- ------- ------ ------- --------------
False 8589 190 0.0216 (190.0/8779.0)
True 272 904 0.2313 (272.0/1176.0)
Total 8861 1094 0.0464 (462.0/9955.0)
confusion matrix valid: Confusion Matrix (Act/Pred) @ threshold = 0.3555305434918455
False True Error Rate
----- ------- ------ ------- ----------------
False 23367 802 0.0332 (802.0/24169.0)
True 1486 1580 0.4847 (1486.0/3066.0)
Total 24853 2382 0.084 (2288.0/27235.0)
confusion matrix test: Confusion Matrix (Act/Pred) @ threshold = 0.3546996890950105
False True Error Rate
----- ------- ------ ------- ----------------
False 23399 769 0.0318 (769.0/24168.0)
True 1537 1529 0.5013 (1537.0/3066.0)
Total 24936 2298 0.0847 (2306.0/27234.0)
We can see that on valid and test confusion matrix I have my ~30k totals cases but I only have ~10k total cases on train confusion matrix instead of the initial ~200k rows. Why ?
EDIT 1: Here is the leaderboard of the models :
LEADERBOARD:
model_id aucpr auc logloss mean_per_class_error rmse mse training_time_ms predict_time_per_row_ms algo
StackedEnsemble_BestOfFamily_1_AutoML_1_20230504_164001 0.632635 0.876718 0.226965 0.260764 0.253097 0.0640579 4397 0.041607 StackedEnsemble
GBM_1_AutoML_1_20230504_164001 0.632514 0.876067 0.237024 0.262188 0.254668 0.0648556 24116 0.038866 GBM
StackedEnsemble_BestOfFamily_2_AutoML_1_20230504_164001 0.631139 0.87681 0.22705 0.262454 0.253345 0.0641838 1421 0.049998 StackedEnsemble
GBM_4_AutoML_1_20230504_164001 0.491181 0.810158 0.338646 0.308554 0.311599 0.0970939 638 0.001294 GBM
GBM_3_AutoML_1_20230504_164001 0.374094 0.762457 0.342545 0.353879 0.312943 0.0979334 642 0.001166 GBM
DRF_1_AutoML_1_20230504_164001 0.373471 0.755862 2.01974 0.290401 0.311465 0.0970105 1735 0.001758 DRF
GBM_2_AutoML_1_20230504_164001 0.355587 0.758635 0.343282 0.371797 0.3132 0.0980943 960 0.001194 GBM
GLM_1_AutoML_1_20230504_164001 0.330708 0.727998 0.312086 0.362253 0.298647 0.08919 23648 0.001599 GLM
[8 rows x 10 columns]
DeepLearning and StackedEnsemble models have a parameter score_training_samples
that defaults to 10 000 which speeds up the training by calculating the training scores only on a sample - the rationale behind it is that users don't generally care much about the training performance metrics so the estimate on the sample is often sufficient while providing a speed up.
You can use best_model.confusion_matrix(training_frame)
to get confusion matrix for the whole training frame. More details are in the documentation.