pythonpandas

How do I compute group means in a pandas DataFrame when grouped by ranking?


What I'd like to do is create a DataFrame with dates as indices, ranks as columns, and values pulled from scores. Since the ranking method guarantees uniqueness, I suppose I can melt/dropna/pivot to get what I want. This approach feels kind of brittle, though, and I'm not sure what I'd do if I had non-unique rank values.

Here's the problem setup:

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)
dates = pd.date_range('2024-08-01', '2024-08-07')
contestants = ['Alligator', 'Beryl', 'Chupacabra', 'Dandelion', 'Eggplant', 'Feldspar']

random_scores = rng.random(len(dates) * len(contestants))
random_scores = random_scores.reshape((len(dates), len(contestants)))
scores = pd.DataFrame(random_scores, dates, contestants)
scores.index.name = 'DATE'
scores.columns.name = 'CONTESTANT'
ranks = scores.rank(axis=1, method='first', ascending=False)
top3_ranks = ranks.where(ranks <= 3)

This yields the following output:

>>> scores
CONTESTANT  Alligator     Beryl  Chupacabra  Dandelion  Eggplant  Feldspar
DATE
2024-08-01   0.773956  0.438878    0.858598   0.697368  0.094177  0.975622
2024-08-02   0.761140  0.786064    0.128114   0.450386  0.370798  0.926765
2024-08-03   0.643865  0.822762    0.443414   0.227239  0.554585  0.063817
2024-08-04   0.827631  0.631664    0.758088   0.354526  0.970698  0.893121
2024-08-05   0.778383  0.194639    0.466721   0.043804  0.154289  0.683049
2024-08-06   0.744762  0.967510    0.325825   0.370460  0.469556  0.189471
2024-08-07   0.129922  0.475705    0.226909   0.669814  0.437152  0.832678

>>> top3_ranks
CONTESTANT  Alligator  Beryl  Chupacabra  Dandelion  Eggplant  Feldspar
DATE
2024-08-01        3.0    NaN         2.0        NaN       NaN       1.0
2024-08-02        3.0    2.0         NaN        NaN       NaN       1.0
2024-08-03        2.0    1.0         NaN        NaN       3.0       NaN
2024-08-04        3.0    NaN         NaN        NaN       1.0       2.0
2024-08-05        1.0    NaN         3.0        NaN       NaN       2.0
2024-08-06        2.0    1.0         NaN        NaN       3.0       NaN
2024-08-07        NaN    3.0         NaN        2.0       NaN       1.0

Here's what I tried:

melted_scores = scores.T.melt(value_name='SCORE')
melted_ranks = top3_ranks.T.melt(value_name='RANK')
melted_scores['RANK'] = melted_ranks['RANK']
desired_output = melted_scores.dropna().pivot(columns='RANK', index='DATE', values='SCORE')
>>> desired_output
RANK             1.0       2.0       3.0
DATE
2024-08-01  0.975622  0.858598  0.773956
2024-08-02  0.926765  0.786064  0.761140
2024-08-03  0.822762  0.643865  0.554585
2024-08-04  0.970698  0.893121  0.827631
2024-08-05  0.778383  0.683049  0.466721
2024-08-06  0.967510  0.744762  0.469556
2024-08-07  0.832678  0.669814  0.475705

Is there a better way to do this that can also handle non-unique rank values?


Solution

  • Pivotting is a correct approach. You could avoid the double melt by using concat and stack before pivot:

    (pd.concat({'score': scores, 'rank': top3_ranks}, axis=1)
       .stack().dropna(subset=['rank']).droplevel('CONTESTANT')
       .pivot(columns='rank', values='score')
    )
    

    Output:

    rank             1.0       2.0       3.0
    DATE                                    
    2024-08-01  0.975622  0.858598  0.773956
    2024-08-02  0.926765  0.786064  0.761140
    2024-08-03  0.822762  0.643865  0.554585
    2024-08-04  0.970698  0.893121  0.827631
    2024-08-05  0.778383  0.683049  0.466721
    2024-08-06  0.967510  0.744762  0.469556
    2024-08-07  0.832678  0.669814  0.475705
    

    If you just want to take the top 3 scores, you could use :

    a = scores.to_numpy()
    a.sort(axis=1)
    N = 3
    out = pd.DataFrame(a[:, :-N-1:-1], index=scores.index,
                       columns=pd.Index(range(1, N+1), name='RANK'))
    

    Output:

    RANK               1         2         3
    DATE                                    
    2024-08-01  0.975622  0.858598  0.773956
    2024-08-02  0.926765  0.786064  0.761140
    2024-08-03  0.822762  0.643865  0.554585
    2024-08-04  0.970698  0.893121  0.827631
    2024-08-05  0.778383  0.683049  0.466721
    2024-08-06  0.967510  0.744762  0.469556
    2024-08-07  0.832678  0.669814  0.475705