cluster-analysisk-meanssilhouette

Cluster Analysis: correcting observations with negative silhouette width


I am trying to find patterns in a dataset (~1000 series) containing time series data with yearly frequency. Some sample data:

         V1     V2     V3     V4     V5     V6     V7     V8     V9    V10    V11    V12    V13    V14    V15    V16    V17    V18
1 1.0000 0.6154 0.0000 0.0769 0.0000 0.0000 0.0000 0.2308 0.6923 0.6923 0.6923 0.6923 0.6923 0.3846 0.3846 0.0769 0.0769 0.0769
2 1.0000 0.8354 0.5274 0.4451 0.4604 0.4634 0.4543 0.2195 0.0976 0.1159 0.0793 0.0000 0.0152 0.0305 0.0305 0.0335 0.0915 0.0152
3 0.9524 0.8571 0.2381 0.1429 0.6667 1.0000 1.0000 0.1905 0.4286 0.3810 0.3810 0.5714 0.0952 0.1905 0.0000 0.0000 0.0952 0.8571
4 0.9200 1.0000 0.6000 0.4000 0.0000 0.4200 0.3600 0.4400 0.4200 0.3200 0.4800 0.6400 0.5200 0.5200 0.5200 0.5400 0.4800 0.7800
5 0.8372 1.0000 0.7209 0.7907 0.6279 0.6047 0.6047 0.6279 0.5349 0.4419 0.4419 0.2791 0.4419 0.2326 0.1860 0.1860 0.1860 0.0000
6 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.6154 0.6154 0.6154 0.6154 1.0000

Note that the data is normalized, because I want to cluster the timeseries based on similar shapes. I imagined that a cluster analysis would be an appropiate analysis and I tried to cluster the time series with the following function:

a <- factoextra::eclust(Normalized_df, FUNcluster = "kmeans", nstart = 25, k.max = 5)

However, I have a couple of observations which have a negative silouhette width. Is there a way to correct for these assignments? For example, if the value sil_width is negative, then place the observation in neighbour cluster. An example can be found below.

cluster neighbor    sil_width
    1       1        3 -0.001258464
    2       1        3 -0.004661913
    3       1        4 -0.010083277
    4       1        4 -0.012569472
    5       1        3 -0.012793575
    6       1        4 -0.013089868
    7       1        5 -0.013346165

The motivation is to correct for these observations, in order to increase the average silhouette width for the clusters.

Any help would be much appreciated!


Solution

  • Moving points with a negative silhouette to another cluster would likely decrease the Silhouette of other points in that cluster. It's not obvious how to druther improve the results, and a) the best solution may contain negative Silhouette values, and b) it might be impossible to find a solution with only positive values. Last but not least, c) it will not be a k-means clustering result anymore - some points will no longer be assigned to the closest mean.

    The core reason is that the scores within each cluster are tied. Moving one point to another cluster changes all their scores.