I have a problem. I want to cluster my dataset. Unfortunately my centroids are not in the clusters but outside. I have already read Python k-mean, centroids are placed outside of the clusters about this.
However, I do not know what could be the reason. How can I cluster correctly?
You can find the dataset at https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset
import pandas as pd
from sklearn.cluster import KMeans
from scipy.cluster import hierarchy
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import silhouette_samples
import matplotlib as mpl
import matplotlib.pyplot as plt
df = pd.read_csv(r'https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset')
df.shape
features_clustering = ['review_scores_accuracy',
'distance_to_center',
'bedrooms',
'review_scores_location',
'review_scores_value',
'number_of_reviews',
'beds',
'review_scores_communication',
'accommodates',
'review_scores_checkin',
'amenities_count',
'review_scores_rating',
'reviews_per_month',
'corrected_price']
df_cluster = df[features_clustering].copy()
X = df_cluster.copy()
model = KMeans(n_clusters=4, random_state=53, n_init=10, max_iter=1000, tol=0.0001)
clusters = model.fit_predict(X)
df_cluster["cluster"] = clusters
fig = plt.figure(figsize=(8, 8))
sns.scatterplot(data=df_cluster, x="amenities_count", y="corrected_price", hue="cluster", palette='Set2_r')
sns.scatterplot(x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1], color='blue',marker='*',
label='centroid', s=250)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#plt.ylim(ymin=0)
plt.xlim(xmin=-0.1)
plt.show()
model.cluster_centers_
inertia = model.inertia_
sil = metrics.silhouette_score(X,model.labels_)
print(f'inertia {inertia:.3f}')
print(f'silhouette {sil:.3f}')
[OUT]
inertia 4490.076
silhouette 0.156
The answer to your main question: the cluster centers are not outside of your clusters.
1 : You are clustering over 14 features shown in features_clustering
list.
2 : You are viewing the clusters over a two-dimensional space, arbitrarily choosing amenities_count
and corrected_price
for the data and two coordinates for the cluster centers x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1]
which don't correspond to the same features.
For these reasons you are going to get strange results; they really don't mean anything.
The bottom line is you cannot view 14 dimension clustering over two-dimensions.
To show point 2 more clearly, change the plotting of the clusters line to
sns.scatterplot(x=model.cluster_centers_[:,10], y=model.cluster_centers_[:,13], color='blue',marker='*', label='centroid', s=250)
to be plotting the cluster centers against the same features as the data.
The link to the SO answer about the cluster centers being outside of the cluster data is about scaling the data before clustering to be between 0 and 1, and then not scaling the cluster centers back up when plotting with the real data. This is not the same as your issues here.