Why does my training confusion matrix in h2o AutoML only shows 10k total cases instead of 200k

I am currently using h2o autoML to train a model on a binary classification problem. I have a train (70% ~200k rows), valid (10% ~30k rows), test (10% ~30k rows) and blend (10% ~30k rows) datasets all coming from the time sensitive splitting of the original dataset (~300k rows).

When checking the training confusion matrix I only see 10k total cases instead of ~200k.

I create the model like this :

#  Create model
aml = H2OAutoML(
    stopping_metric=stopping_metric, # "AUCPR"
    sort_metric=sort_metric, # "AUCPR"
    nfolds=nfolds, # set to 0
    distribution=distribution, # "bernoulli"
    balance_classes=balance_classes, # False


# Get the best model
best_model = aml.get_best_model()

# get the performance on test
performance = best_model.model_performance(test)

# define the threshold based on the desired metric
best_threshold = best_model.find_threshold_by_max_metric(
        metric=metric_to_use, valid=True)

# inspect confusion matrix on training set using that threshold
train_confusion = best_model.confusion_matrix(
        thresholds=best_threshold, train=True)

# inspect confusion matrix on test using that threshold
test_confusion = performance.confusion_matrix(thresholds=best_threshold)

# confusion matrix validation using that threshold
valid_confusion = best_model.confusion_matrix(
    thresholds=best_threshold, valid=True)

These are the resulting confusion matrix:

confusion matrix train: Confusion Matrix (Act/Pred) @ threshold = 0.35701837501784456
       False    True    Error    Rate
-----  -------  ------  -------  --------------
False  8589     190     0.0216   (190.0/8779.0)
True   272      904     0.2313   (272.0/1176.0)
Total  8861     1094    0.0464   (462.0/9955.0) 

confusion matrix valid: Confusion Matrix (Act/Pred) @ threshold = 0.3555305434918455
       False    True    Error    Rate
-----  -------  ------  -------  ----------------
False  23367    802     0.0332   (802.0/24169.0)
True   1486     1580    0.4847   (1486.0/3066.0)
Total  24853    2382    0.084    (2288.0/27235.0) 

confusion matrix test: Confusion Matrix (Act/Pred) @ threshold = 0.3546996890950105
       False    True    Error    Rate
-----  -------  ------  -------  ----------------
False  23399    769     0.0318   (769.0/24168.0)
True   1537     1529    0.5013   (1537.0/3066.0)
Total  24936    2298    0.0847   (2306.0/27234.0) 

We can see that on valid and test confusion matrix I have my ~30k totals cases but I only have ~10k total cases on train confusion matrix instead of the initial ~200k rows. Why ?

EDIT 1: Here is the leaderboard of the models :

model_id                                                    aucpr       auc    logloss    mean_per_class_error      rmse        mse    training_time_ms    predict_time_per_row_ms  algo
StackedEnsemble_BestOfFamily_1_AutoML_1_20230504_164001  0.632635  0.876718   0.226965                0.260764  0.253097  0.0640579                4397                   0.041607  StackedEnsemble
GBM_1_AutoML_1_20230504_164001                           0.632514  0.876067   0.237024                0.262188  0.254668  0.0648556               24116                   0.038866  GBM
StackedEnsemble_BestOfFamily_2_AutoML_1_20230504_164001  0.631139  0.87681    0.22705                 0.262454  0.253345  0.0641838                1421                   0.049998  StackedEnsemble
GBM_4_AutoML_1_20230504_164001                           0.491181  0.810158   0.338646                0.308554  0.311599  0.0970939                 638                   0.001294  GBM
GBM_3_AutoML_1_20230504_164001                           0.374094  0.762457   0.342545                0.353879  0.312943  0.0979334                 642                   0.001166  GBM
DRF_1_AutoML_1_20230504_164001                           0.373471  0.755862   2.01974                 0.290401  0.311465  0.0970105                1735                   0.001758  DRF
GBM_2_AutoML_1_20230504_164001                           0.355587  0.758635   0.343282                0.371797  0.3132    0.0980943                 960                   0.001194  GBM
GLM_1_AutoML_1_20230504_164001                           0.330708  0.727998   0.312086                0.362253  0.298647  0.08919                 23648                   0.001599  GLM
[8 rows x 10 columns]


  • DeepLearning and StackedEnsemble models have a parameter score_training_samples that defaults to 10 000 which speeds up the training by calculating the training scores only on a sample - the rationale behind it is that users don't generally care much about the training performance metrics so the estimate on the sample is often sufficient while providing a speed up.

    You can use best_model.confusion_matrix(training_frame) to get confusion matrix for the whole training frame. More details are in the documentation.