pythonscikit-learngridsearchcv

Using GridSearchCV for kmeans for an outlier detection problem


I want to write a thesis about outlier detection and want to compare a few outlier detection methods in a little experiment. According to this [paper] (https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/PDFs/2022_schmidl_anomaly.pdf) kmeans is one possibility which had acceptable results. Anyhow, kmeans is originally not meant to be an outlier detection algorithm. Kmeans has a parameter k (number of clusters), which can and should be optimised. For this I want to use sklearns "GridSearchCV" method. I am assuming, that I know which data points are outliers.

I was writing a method, which is calculating what distance each data point has from its centroid and if it is more than the 1.5 fold of the standard deviation I consider this an outlier and mark the data point as outlier.

Later I compare this result with my ground truth and calculate values like f1, recall, accuracy, precision, rocauc.

If I call GridSearchCV like this

import sklearn
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import KMeans
import numpy as np

points = np.array([[1, 1], [2, 2], [3, 3], [4, 4], [5, 50], [6, 6], [7, 7], [8, 8], [9, 9],[10, 10],[11,11],[12,70],[13,13],[14,14],[15,15],[16,16],[17,17],[18,18],[19,19],[20,20]])
outlier_gt = np.array([0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0])

import pandas as pd
df = pd.DataFrame(data = points, columns=['time','value'])
df['value2'] = df['value'].shift(1)

df2 = df.loc[:, ['value', 'value2']]
df2 = df2.dropna()
pointsnew = df2.to_numpy()

clustering = KMeans()
param_grid = {"n_clusters": range(1, 7)}
grid = GridSearchCV(clustering, param_grid=param_grid)
res = grid.fit(pointsnew,outlier_gt)
print(res.cv_results_)
print(grid.best_params_)

the results don't look very good (tells me k=1 is best, but k=2 or k=4 are better). I guess the problem is, that the fit methods is returning class labels (1,2,3,4 etc.) and my groundtruth "outlier_gt" is only containing labels true or false (0=false, 1=true). So the calculations of the scores don't make a lot of sense and I should write my own scoring class, which could use the distances to the centroids like in the following method:

def getDistanceToCentroidKMeans(points,centroids, labels):
    print(points.shape)
    print(centroids.shape)
    print(labels.shape)
    counter = 0
    res = np.empty(labels.shape)
    for point in points:
        mylabel = labels[counter]
        length = point.shape[0]
        centroid = np.array(centroids[mylabel,0])
        i = 1
        while i < length:
                centroid = np.append(centroid, centroids[mylabel,i])
                i = i + 1
        mydistance = (distance.euclidean(point, centroid))
        res[counter] = mydistance
        counter = counter + 1
    return res
    #print(kmeans.transform(points))

def getOutlierLabels(distances):
    counter = 0
    res = np.empty(distances.shape,dtype=int)
    outlier = 0
    average = np.mean(distances)
    std = np.std(distances)
    upperLimit = average+1.5*std
    lowerLimit = average-1.5*std
    for dist in distances:
        if (dist > (upperLimit)):
            outlier = 1
        else: 
            if (dist < (lowerLimit)):
                outlier = 1
            else:
                outlier = 0
        res[counter] = outlier
        counter = counter + 1
    print('upperLimit: '+str(upperLimit))
    print('lowerLimit: '+str(lowerLimit))
    return res

For the first method anyhow I also need the centroids and labels resulting from the kmean.fit function. When I try to write my own scoring function like

my_scorer = make_scorer(my_custom_loss_func, greater_is_better=True)

I guess the function can only include the standard parameters X,Y is that right? If it can contain additional parameters, how can I add them (like centroids and labels)? Or is there another way how I can use GridSearchCV for kmeans in the context of outlier detection? Another thing is, that splitting it in 3 parts, doesn't make sense in my case because the data in the example is very limited, but I guess I will find a solution for that.

Best regards and thanks for your help!


Solution

  • Callable scorers have signature (fitted_estimator, X_test, y_test), so you should be able to retrieve the centroids and labels from the fitted KMeans estimator. Just define your callable directly (def my_custom_loss_func(estimator, X, y): ...) instead of using the convenience function make_scorer (which turns a metric with signature (y_true, y_pred) into a scorer).