I want to write a thesis about outlier detection and want to compare a few outlier detection methods in a little experiment. According to this [paper] (https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/PDFs/2022_schmidl_anomaly.pdf) kmeans is one possibility which had acceptable results. Anyhow, kmeans is originally not meant to be an outlier detection algorithm. Kmeans has a parameter k (number of clusters), which can and should be optimised. For this I want to use sklearns "GridSearchCV" method. I am assuming, that I know which data points are outliers.
I was writing a method, which is calculating what distance each data point has from its centroid and if it is more than the 1.5 fold of the standard deviation I consider this an outlier and mark the data point as outlier.
Later I compare this result with my ground truth and calculate values like f1, recall, accuracy, precision, rocauc.
If I call GridSearchCV like this
import sklearn
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import KMeans
import numpy as np
points = np.array([[1, 1], [2, 2], [3, 3], [4, 4], [5, 50], [6, 6], [7, 7], [8, 8], [9, 9],[10, 10],[11,11],[12,70],[13,13],[14,14],[15,15],[16,16],[17,17],[18,18],[19,19],[20,20]])
outlier_gt = np.array([0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0])
import pandas as pd
df = pd.DataFrame(data = points, columns=['time','value'])
df['value2'] = df['value'].shift(1)
df2 = df.loc[:, ['value', 'value2']]
df2 = df2.dropna()
pointsnew = df2.to_numpy()
clustering = KMeans()
param_grid = {"n_clusters": range(1, 7)}
grid = GridSearchCV(clustering, param_grid=param_grid)
res = grid.fit(pointsnew,outlier_gt)
print(res.cv_results_)
print(grid.best_params_)
the results don't look very good (tells me k=1 is best, but k=2 or k=4 are better). I guess the problem is, that the fit methods is returning class labels (1,2,3,4 etc.) and my groundtruth "outlier_gt" is only containing labels true or false (0=false, 1=true). So the calculations of the scores don't make a lot of sense and I should write my own scoring class, which could use the distances to the centroids like in the following method:
def getDistanceToCentroidKMeans(points,centroids, labels):
print(points.shape)
print(centroids.shape)
print(labels.shape)
counter = 0
res = np.empty(labels.shape)
for point in points:
mylabel = labels[counter]
length = point.shape[0]
centroid = np.array(centroids[mylabel,0])
i = 1
while i < length:
centroid = np.append(centroid, centroids[mylabel,i])
i = i + 1
mydistance = (distance.euclidean(point, centroid))
res[counter] = mydistance
counter = counter + 1
return res
#print(kmeans.transform(points))
def getOutlierLabels(distances):
counter = 0
res = np.empty(distances.shape,dtype=int)
outlier = 0
average = np.mean(distances)
std = np.std(distances)
upperLimit = average+1.5*std
lowerLimit = average-1.5*std
for dist in distances:
if (dist > (upperLimit)):
outlier = 1
else:
if (dist < (lowerLimit)):
outlier = 1
else:
outlier = 0
res[counter] = outlier
counter = counter + 1
print('upperLimit: '+str(upperLimit))
print('lowerLimit: '+str(lowerLimit))
return res
For the first method anyhow I also need the centroids and labels resulting from the kmean.fit function. When I try to write my own scoring function like
my_scorer = make_scorer(my_custom_loss_func, greater_is_better=True)
I guess the function can only include the standard parameters X,Y is that right? If it can contain additional parameters, how can I add them (like centroids and labels)? Or is there another way how I can use GridSearchCV for kmeans in the context of outlier detection? Another thing is, that splitting it in 3 parts, doesn't make sense in my case because the data in the example is very limited, but I guess I will find a solution for that.
Best regards and thanks for your help!
Callable scorers have signature (fitted_estimator, X_test, y_test)
, so you should be able to retrieve the centroids and labels from the fitted KMeans
estimator. Just define your callable directly (def my_custom_loss_func(estimator, X, y): ...
) instead of using the convenience function make_scorer
(which turns a metric with signature (y_true, y_pred)
into a scorer).