h2odriverless-ai

Is it possible to define how many variables to use for the final model in H2O Driverless


I am exploring the functionalities of H2O DAI at the moment. Understand that H2O has the capability of choosing what variables to use and what transformers to apply on them during the feature selection/engineering phase. But is there a way to config in H2O DAI to limit the maximum number of features it could use out of the provided list? E.g., there are 100 features given, I only want H2O DAI to select 20 features out of it and apply feature engineering on it. Tried to browse through the user manual but did not find any hints on this so far.

Many thanks in advance.


Solution

  • There are a few options to control number of features used

    # Maximum number of columns selected out of original set of original columns, using feature selection
    # The selection is based upon how well target encoding (or frequency encoding if not available) on categoricals and numerics treated as categoricals
    # This is useful to reduce the final model complexity. First the best
    # [max_orig_cols_selected] are found through feature selection methods and then
    # these features are used in feature evolution (to derive other features) and in modelling.
    #max_orig_cols_selected = 10000
    
    # Maximum number of numeric columns selected, above which will do feature selection
    # same as above (max_orig_cols_selected) but for numeric columns.
    #max_orig_numeric_cols_selected = 10000
    
    # Maximum number of non-numeric columns selected, above which will do feature selection on all features and avoid treating numerical as categorical
    # same as above (max_orig_numeric_cols_selected) but for categorical columns.
    #max_orig_nonnumeric_cols_selected = 300
    
    # Like max_orig_cols_selected, but columns above which add special individual with original columns reduced.
    # 
    #fs_orig_cols_selected = 500
    
    # Maximum features per model (and each model within the final model if ensemble) kept.
    # Keeps top variable importance features, prunes rest away, after each scoring.
    # Final ensemble will exclude any pruned-away features and only train on kept features,
    # but may contain a few new features due to fitting on different data view (e.g. new clusters)
    # Final scoring pipeline will exclude any pruned-away features,
    # but may contain a few new features due to fitting on different data view (e.g. new clusters)
    # -1 means no restrictions except internally-determined memory and interpretability restrictions.
    # Notes:
    # * If interpretability > remove_scored_0gain_genes_in_postprocessing_above_interpretability, then
    # every GA iteration post-processes features down to this value just after scoring them.  Otherwise,
    # only mutations of scored individuals will be pruned (until the final model where limits are strictly applied).
    # * If ngenes_max is not also limited, then some individuals will have more genes and features until
    # pruned by mutation or by preparation for final model.
    # * E.g. to generally limit every iteration to exactly 1 features, one must set nfeatures_max=ngenes_max=1
    # and remove_scored_0gain_genes_in_postprocessing_above_interpretability=0, but the genetic algorithm
    # will have a harder time finding good features.
    # 
    #nfeatures_max = -1
    

    See the config.toml file or look in expert settings.

    Note that you can't control specific features of having transformers or not.