machine-learningh2osklearn-pandaspredictiveh2o4gpu

H20 Autoencoder Anomaly only accepting numerical predictors


I am using h2o autoencoder anomaly for finding outlier data in my model but issue is autoencoder only accepts numerical predictors. My requirement is i have find outlier's based on CardNumber or merchant number. and Cardnumber is 12 digit(342178901244) and unique mostly So its nominal data and we can not do hot encoding as well as it will create many new fields as many as unique card no. So please suggest any way we can include categorical data as well and still we can run autoencoder

model=H2OAutoEncoderEstimator(activation="Tanh",
                              hidden=[70],
                              ignore_const_cols=False,
                              epochs=40)

model.train(x=predictors,training_frame=train.hex)

#Get anomalous values
test_rec_error=model.anomaly(test.hex,per_feature=True)
train_rec_error=model.anomaly(train.hex,per_feature=True)
recon_error_df['outlier'] = np.where(recon_error_df['Reconstruction.MSE'] > top_whisker, 'outlier', 'no_outlier')

Solution

  • You can't put an almost-unique categorical feature in a predictor (autoencoder or anything else) and expect it to work.

    Instead you need to extract meaningful features from it, which depend on the problem you want to solve. For example if it is a credit card number you could add a feature encoding the card circuit (VISA, Mastercard, American Express, ...).
    The limit is only your imagination and knowledge of the domain.