pythonscikit-learnisolation-forest

How to give more importance to some features in sklearn Isolation Forest


I am using sklearn isolation forest for an anomaly detection task. Isolation forest consists of iTrees. As this paper describes, the nodes of the iTrees are split in the following way: We select any feature (uniformly) randomly and perform a split on a random value of that feature.

But I want to give more weight to some features than the others. So instead of selecting the features with equal probability, I want to draw some features with a higher probability (giving more weight to those features) and other features with a lower probability.

How can I do that? From the source code it seems I have to change the function _generate_bagging_indices in _bagging.py, but not sure.


Solution

  • You can achieve this without changing the source code. Instead, you can tweak your input data by duplicating the features you wish to increase the weight for. If you have a feature appearing twice, the trees will use it twice to split your data, which in practice will mean the same as having doubled the weight of the feature.

    In addition to this, you can also choose to reduce the amount of features used by your isolation forest in each tree. This is controlled by the argument max_features. The default value of 1.0 ensures that every feature will be used for each tree. By reducing it, more trees will be trained without the less frequent features in your input.

    Illustration

    Load Data

    from sklearn.ensemble import IsolationForest
    import pandas as pd
    from sklearn.datasets import load_iris
    import matplotlib.pyplot as plt
    
    data = load_iris()
    X = data.data
    df = pd.DataFrame(X, columns=data.feature_names)
    

    Default settings

    IF = IsolationForest()
    IF.fit(df)
    preds = IF.predict(df)
    
    plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds)
    plt.title("Default settings")
    plt.xlabel("sepal length (cm)")
    plt.ylabel("sepal width (cm)")
    plt.show()
    

    default_settings

    Weighted Settings

    df1 = df.copy()
    weight_feature = 10
    for i in range(weight_feature):
        df1["duplicated_" + str(i)] = df1["sepal length (cm)"]
    
    IF1 = IsolationForest(max_features=0.3)
    IF1.fit(df1)
    preds1 = IF1.predict(df1)
    
    plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds1)
    plt.title("Weighted settings")
    plt.xlabel("sepal length (cm)")
    plt.ylabel("sepal width (cm)")
    plt.show()
    

    weighted_settings

    As you can see visually, the second option has used the X-axis more intensively to determine which are the outliers.