rh2o

h2o.glm does not match glm in R for linear regressions


I have been working with H2O.ai (version 3.10.3.6) in combination with R.

I am struggling to replicate the results from glm with h2o.glm. I would expect exactly the same result (evaluated, in this case, in terms of mean square error), but I am seeing must worse accuracy with h2o. Since my model is Gaussian, I would expect both cases to be ordinary least squares (or maximum likelihood) regressions.

Here is my example:

train <- model.matrix(~., training_df)
test <- model.matrix(~., testing_df)

model1 <- glm(response ~., data=data.frame(train))
yhat1 <- predict(model1 , newdata=data.frame(test))
mse1 <- mean((testing_df$response - yhat1)^2) #5299.128

h2o_training <- as.h2o(train)[-1,]
h2o_testing <- as.h2o(test)[-1,]

model2 <- h2o.glm(x = 2:dim(h2o_training)[2], y = 1,
                  training_frame = h2o_training,
                  family = "gaussian", alpha = 0)

yhat2 <- h2o.predict(model2, h2o_testing)
yhat2 <- as.numeric(as.data.frame(yhat2)[,1])
mse2 <- mean((testing_df$response - yhat2)^2) #8791.334

The MSE is 60% higher for the h2o model. Is my hypothesis that glm ≈ h2o.glm wrong? I will look to provide an example dataset asap (the training dataset is confidential and 350000 rows x 350 columns).

An extra question: for some reason, as.h2o adds an extra row full of NAs, so that h2o_training and h2o_testing have an additional row. Removing it (as I do here: as.h2o(train)[-1,]) before building the model does not affect the regression performance. There are no NA values passed to either glm or h2o.glm; i.e. the training matrices do not have NA values.


Solution

  • There are a few arguments you need to set in order to get H2O's GLM to match R's GLM, since by default, they do not function the same way. Here is an example of what you need to set to get identical results:

    library(h2o)
    h2o.init(nthreads = -1)
    
    path <- system.file("extdata", "prostate.csv", package = "h2o")
    train <- h2o.importFile(path)
    
    # Run GLM of VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON
    x <- setdiff(colnames(train), c("ID", "DPROS", "DCAPS", "VOL"))
    
    # Train H2O GLM (designed to match R)
    h2o_glmfit <- h2o.glm(y = "VOL", 
                          x = x, 
                          training_frame = train, 
                          family = "gaussian",
                          lambda = 0,
                          remove_collinear_columns = TRUE,
                          compute_p_values = TRUE,
                          solver = "IRLSM")
    
    # Train an R GLM
    r_glmfit <- glm(VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON, 
                    data = as.data.frame(train)) 
    

    Here are the coefs (they match):

    > h2o.coef(h2o_glmfit)
      Intercept     CAPSULE         AGE 
    -4.35605671 -4.29056573  0.29789896 
           RACE         PSA     GLEASON 
     4.35567076  0.04945783 -0.51260829 
    
    > coef(r_glmfit)
    (Intercept)     CAPSULE         AGE 
    -4.35605671 -4.29056573  0.29789896 
           RACE         PSA     GLEASON 
     4.35567076  0.04945783 -0.51260829 
    

    I've added a JIRA ticket to add this info to the docs.