python algorithm machine-learning scikit-learn anomaly-detection

Isolation Forest Sklearn for 1D array or list and how to tune hyper parameters

Is there a way to implement sklearn isolation forest for a 1D array or list? All the examples I came across are for data of 2 Dimension or more.

I have right now developed a model with three features and the example code snipped is mentioned below:

# dataframe of three columns
df_data = datafr[['col_A', 'col_B', 'col_C']]
w_train = page_data[:700]
w_test = page_data[700:-2]

from sklearn.ensemble import IsolationForest
# fit the model
clf = IsolationForest(max_samples='auto')
clf.fit(w_train)

#testing it using test set
y_pred_test = clf.predict(w_test)

The reference I mainly relied upon: IsolationForest example | scikit-learn

The df_data is a data frame with three columns. I am actually looking to find outlier in 1 Dimension or list data.

The other question is how to tune an isolation forest model? One of the ways is to increase the contamination value to reduce the false positives. But how to use the other parameters like n_estimators, max_samples, max_features, versbose, etc.

Solution

It won't make sense to apply Isolation forest to 1D array or list. This is because in that case it would simply be a one to one mapping from feature to target.

You can read the official documentation to get a better idea of the different parameters helps

contamination

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

Try experimenting with different values in range [0,0.5] to see which one gives the best results

max_features

The number of features to draw from X to train each base estimator.

Try values like 5,6,10, etc. any int of your choice and validate it with the final test data

n_estimators try multiple values like 10,20,50, etc. to see which works best.

You can also use GridSearchCV to automate this process of parameter estimation.

Just try experimenting with different values using gridSearchCV and see which one gives the best results.

Try this

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer

my_scoring_func = make_scorer(f1_score)
parameters = {'n_estimators':[10,30,50,80], 'max_features':[0.1, 0.2, 0.3,0.4], 'contamination' : [0.1, 0.2, 0.3]}
iso_for =  IsolationForest(max_samples='auto')
clf = GridSearchCV(iso_for, parameters,  scoring=my_scoring_func)

Then use clf to fit the data. Although note that GridSearchCV requires bot x and y (i.e. train data and labels) for the fit method.

Note :You can read this blog post for further reference if you wish to use GridSearchCv with Isolation forest, else you can manually try with different values and plot graphs to see the results.