[SOLVED] Does DAI standardize/normalize during training, which methods does it try, and does the genetic algorithm try them all?

Does DAI standardize/normalize during training, which methods does it try, and does the genetic algorithm try them all?

Often I'm unsure to what extent to preprocess my data while using DAI. Often you want to reduce the dimensionality, rid duplicate features, standardize/normalize, etc... for a production level model. Is there a rule at which I should stop personal preprocessing in favor of DAI (I.E. Only rid a binary classification algorithm of Nan's and DAI will do the rest). Will it explicitly explain which normalization technique it used, like a MinMaxScaler() from Sklearn for example?

Solution

Generally, no preprocessing is needed and the methods DAI uses for internal preprocessing are dependent on the algorithms behind the models.

However, there are specific use cases that may require preprocessing and h2o can assist you with that if you contact them. For example, if you want to predict something at a customer level but your data is transactions, then you need to do preprocessing - say you have grocery store transactions and you want to predict how much the store will make tomorrow. Then you need to aggregate to the day store level since that is the level you want predictions at. Basically any case where the data is more granular than the level you want predictions at needs preprocessing.

For missing values it's best to let Driverless AI handle them unless you know why the values are missing and thus can use domain rules to fill them in. For example if you have transaction = NA but you know that means no money was spent, you'd want to change the NA to 0.

I think the following docs may be helpful: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/faq.html#data-experiments-predictions. In particular the sections 'Can Driverless AI handle data with missing values/nulls?' and 'Does Driverless AI standardize the data?'.

You also can find a lot of information about what your experiment is doing in the experiment report: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/experiment-summary.html. We don't currently report methods of standardization because it happens differently for each model in an ensemble that is potentially quite complex.