I have a DataFrame that's comprised of 0's and 1's in each row, the idea is to compare and cluster all the rows in each df with a specific amount of clusters (in this case let's say 5).
What I need to get is the row indexes for each of the 5 clusters (or .groupby
by cluster with the original row index).
The df looks like this:
0 1 2 3 4 5 6 7 8 9 ... 528 529 530 531 532 533 534 535 536 537
0 0 0 0 0 0 0 0 1 1 1 ... 0 1 1 1 0 0 0 1 0 1
1 0 0 0 0 0 0 0 1 1 1 ... 0 1 1 1 0 0 0 1 0 1
2 0 0 0 0 0 0 0 1 1 1 ... 0 1 1 1 0 0 0 1 0 1
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 1 1
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
137 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
138 1 1 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
139 1 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
140 1 1 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
141 1 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
I found another answer that provides this solution here: Kmeans Cluster for each group in pandas dataframe and assign clusters
def cluster(X):
k_means = KMeans(n_clusters=5).fit(X)
return X.groupby(k_means.labels_)\
.transform('mean').sum(1)\
.rank(method='dense').sub(1)\
.astype(int).to_frame()
And the result I'm getting is:
0
0 1
1 1
2 1
3 0
4 0
... ...
137 3
138 1
139 3
140 3
141 3
But to be fair I don't entirely understand what it does and if the result I am getting here is the cluster number for each row
I'm not entirely sure what your example piece does either, but for your use case something like this would work. First, get the cluster labels:
from sklearn.cluster import KMeans
df["cluster"] = KMeans(n_clusters=5).fit(df).labels_
And then if you needed to do something with the indices of each cluster, you can for example get them as a dict with groupby("cluster").indices
>>> df.groupby("cluster").indices
{0: array([0, 1]), 1: array([2, 3]), ...}