I have a pandas dataframe
import pandas as pd
df =pd.DataFrame({'name':['john','joe','bill','richard','sam'],
'cluster':['1','2','3','1','2']})
df['cluster'].value_counts()
will give the number of occurrences of items based on the column cluster
.
Is it possible to retain only the rows which have the maximum number of occurrences in the column cluster
?
The expected output is
The cluster 1 and 2 have the same number of occurrences, so all the rows for cluster 1 and 2 need to be retained.
You can get the max count of cluster
value through df['cluster'].value_counts()
then use isin
to filter cluster
column
c = df['cluster'].value_counts()
out = df[df['cluster'].isin(c[c.eq(c.max())].index)]
print(out)
name cluster
0 john 1
1 joe 2
3 richard 1
4 sam 2