pythonpandascluster-analysisk-meanscentroid

K-Means algorithm Centroids are not placed in the clusters


I have a problem. I want to cluster my dataset. Unfortunately my centroids are not in the clusters but outside. I have already read Python k-mean, centroids are placed outside of the clusters about this.

However, I do not know what could be the reason. How can I cluster correctly?

You can find the dataset at https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset

import pandas as pd
from sklearn.cluster import KMeans
from scipy.cluster import hierarchy
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import silhouette_samples
import matplotlib as mpl
import matplotlib.pyplot as plt

df = pd.read_csv(r'https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset')
df.shape

features_clustering = ['review_scores_accuracy',
 'distance_to_center',
 'bedrooms',
 'review_scores_location',
 'review_scores_value',
 'number_of_reviews',
 'beds',
 'review_scores_communication',
 'accommodates',
 'review_scores_checkin',
 'amenities_count',
 'review_scores_rating',
 'reviews_per_month',
 'corrected_price']

df_cluster = df[features_clustering].copy()
X = df_cluster.copy()

model = KMeans(n_clusters=4, random_state=53, n_init=10, max_iter=1000, tol=0.0001)
clusters = model.fit_predict(X)
df_cluster["cluster"] = clusters

fig = plt.figure(figsize=(8, 8))
sns.scatterplot(data=df_cluster, x="amenities_count", y="corrected_price", hue="cluster", palette='Set2_r')
sns.scatterplot(x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1], color='blue',marker='*',
                            label='centroid', s=250)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#plt.ylim(ymin=0)
plt.xlim(xmin=-0.1)
plt.show()

model.cluster_centers_

enter image description here

inertia = model.inertia_
sil = metrics.silhouette_score(X,model.labels_)

print(f'inertia {inertia:.3f}')
print(f'silhouette {sil:.3f}')

[OUT]

inertia 4490.076
silhouette 0.156


Solution

  • The answer to your main question: the cluster centers are not outside of your clusters.

    1 : You are clustering over 14 features shown in features_clustering list.

    2 : You are viewing the clusters over a two-dimensional space, arbitrarily choosing amenities_count and corrected_price for the data and two coordinates for the cluster centers x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1] which don't correspond to the same features.

    For these reasons you are going to get strange results; they really don't mean anything.

    The bottom line is you cannot view 14 dimension clustering over two-dimensions.

    To show point 2 more clearly, change the plotting of the clusters line to

    sns.scatterplot(x=model.cluster_centers_[:,10], y=model.cluster_centers_[:,13], color='blue',marker='*', label='centroid', s=250)
    

    to be plotting the cluster centers against the same features as the data.


    The link to the SO answer about the cluster centers being outside of the cluster data is about scaling the data before clustering to be between 0 and 1, and then not scaling the cluster centers back up when plotting with the real data. This is not the same as your issues here.