pythonloopscluster-analysisk-meansdtw

How to find Optimal K for each group in data with Kmeans clustering


I have a dataset the has 10 different groups and sales for 3 weeks in a year. I am trying to run a clustering algorithm that clusters each of the groups based on how many items are present in each group. Basically, I want to treat each group differently.

I tried a manual approach and set the clusters of each group relative to the group with the highest number of items but I want to make the code find the optimal k for the kmeans for each group. I am not sure how to determine the best k for each of the groups

Here is the data:

items_scaled = 
PROD    LOB_LABEL    1         2           3    
100001  Books   0.022556    0.020326    0.020556    
100002  Books   0.023756    0.080306    0.020656
100003  Candles 0.022966    0.020178    0.020291    
100004  Shoes   0.021067    0.020485    0.019420    
100005  Candles 0.020403    0.021067    0.020556
100006  Shoes   0.023634    0.026219    0.029357
100007  Books   0.022472    0.017218    0.016454
100008  Pens    0.023670    0.027971    0.029763
100009  Pens    0.037894    0.026664    0.031777
100010  Shoes   0.015929    0.015205    0.015446    
....

Here is my trial on how to find the best k but the runtime really high. At this rate, it might take a day or more to run. I am working with 3500 rows of data. Is there a better/optimal way to achieve my result?

silhouette = []
# count = 0
# K = range(2, len(items))
for lob in item_df['LOB_LABEL'].unique():
    items = item_df[item_df['LOB_LABEL']==lob]
    items_scaled = items.iloc[:, 2:54]
    K = range(2, len(items))
    for k in tqdm(K):
        kmeanModel = TimeSeriesKMeans(n_clusters=k, metric="dtw", n_jobs=6, max_iter=10, n_init=5)
        kmeanModel.fit(items_scaled)
        silhouette.append(silhouette_score(items_scaled, kmeanModel.labels_, metric="dtw"))
        print(lob,max(silhouette),k)
#     if count == 0:
#         break
#     count += 1


Solution

  • The optimal number of clusters is based either on your presumptions, e.g. equal to the highest number of items, or you can determine it empirically. To do that, you run the algorithm for different numbers of k and calculate the error of the clustering, for example by calculating MSE between all members of a cluster and the cluster center. Then you'd have to make a decision between the acceptable error (which most likely decreases with the number of clusters) and whether a larger number of clusters still makes sense for the task at hand.

    To reduce time complexity of the empirical approach, you can change three variables: the maximum for K, the number of iterations and the number of samples used in the parameter sweep. If you want the most optimal solution, you are best served to take the day to run this. However, if you are hard pressed for time or suspect you need to rerun this at some point, I advise you to use a subset of your data for hyperparameter searches.

    More practically speaking, I think you will find k << len(items, so your search range can probably be greatly reduced. Combine that with a subset of the data points and you should save a lot of time.