[SOLVED] How does DAI handle new (unseen in training) categorical values within a production environment?

How does DAI handle new (unseen in training) categorical values within a production environment?

I would like confirmation that DAI follows a similar structure for dealing with categorical variables it didn't encounter within training, as in this answer h2o DRF unseen categorical values handling. I could not find it explicitly within the H2O Driverless AI documentation.

Please also state if parts of that link are outdated (as mentioned in the answer) and how it's being processed if this is happening differently. Please note the version of h2o DAI. Thank you!

Solution

EDIT this information is now detailed in the documentation here

Below is a description of what happens when you try to predict on a categorical level not seen during training. Depending on the version of DAI you use, you may not have access to a certain algorithm, but given an algorithm, the details should apply to your version of DAI.

XGBoost, LightGBM, RuleFit, TensorFlow, GLM

Driverless AI's feature engineering pipeline will compute a numeric value for every categorical level present in the data, whether it's a previously seen value or not. For frequency encoding, unseen levels will be replaced by 0. For target encoding, the global mean of the target value will be used. Etc.

and

FTRL

FTRL model doesn't distinguish between categorical and numeric values. Whether or not FTRL saw a particular value during training, it will hash all the data, row by row, to numeric and then make predictions. Since you can think of FTRL as learning all the possible values in the dataset by heart, there is no guarantee it will make accurate predictions for unseen data. Therefore, it is important to ensure that the training dataset has a reasonable "overlap", in terms of unique values, with the ones used to make predictions.

Since DAI uses different algorithms than H2O-3 (except for XGBoost), it's best to consider these as separate products with potentially different handling of unseen levels or missing values - though in some cases there are similarities.

As mentioned in the comment, the DRF documentation for H2O-3 should be up to date now.

Hope this explanation helps!