pythonpandasdataframecluster-analysis

How can I get the first three max values from each row in a Pandas dataframe?


The dataframe is like this;

Cluster Genre 1 Genre 2 Genre 3 Genre 4 Genre 5
1 10 31 5 3 23
2 53 12 6 9 7
3 44 73 1 9 13

As output, I want something like this, so I can see what genres are the dominant ones in each cluster.

Cluster 1st 2nd 3rd
1 Genre 2 Genre 5 Genre 1
2 Genre 1 Genre 2 Genre 4
3 Genre 2 Genre 1 Genre 5

I want to show the top 3 "genres" from each cluster in a graph, I have no idea how I would do this for a row instead of columns. Is anyone here familiar with this?


Solution

  • You can use numpy.argsort on df.values and axis=1 and select three largest and use df.columns for getting column name:

    import pandas as pd
    import numpy as np
    df = df.set_index('Cluster')
    res = pd.DataFrame(df.columns[np.argsort(-1*df.values,axis=1)[:, :3]], 
                       columns=['1st', '2nd',' 3rd'])
    print(res)
    

    Output:

            1st       2nd       3rd
    0   Genre 2   Genre 5   Genre 1
    1   Genre 1   Genre 2   Genre 4
    2   Genre 2   Genre 1   Genre 5