rcross-validationh2o

Cross Validation Metrics for H2o


I'm having a hard time understanding why the output for various metrics on my models differs when I use h2o.

For example, if I use 'h2o.grid' then the logloss measure is different when I look at the mean model$cross_validation_metrics_summary. It is the same as model$cross_validation_metrics_summary. What is the reasoning behind this difference? What one should I report on?

library(mlbench) 
  library(h2o)
 data(Sonar)

h2o.init() Sonarhex <- as.h2o(Sonar) h2o.grid("gbm", grid_id = "gbm_grid_id0", x = c(1:50), y = 'Class',
         training_frame = Sonarhex, hyper_params = list(ntrees = 50, learn_rate = c(.1, .2, .3)), nfolds = 5, seed=1234)

grid <- h2o.getGrid("gbm_grid_id0", sort_by = 'logloss')

first_model = h2o.getModel(grid@model_ids[[1]]) first_model@model$cross_validation_metrics_summary first_model@model$cross_validation_metrics

Solution

  • This inconsistency is an issue that has been documented and explained here and will be resolved in a future release. The model$cross_validation_metrics_summary metrics are the correct CV metrics. The metrics that appear in the Grid table or by using the utility functions like h2o.logloss(model, xval = TRUE) are slightly different because they aggregate the CV predictions and then compute the loss (instead of computing the loss separately across K folds and then taking the average). This can lead to slight numerical differences.