What I'd like to do is create a DataFrame with dates
as indices, ranks
as columns, and values pulled from scores
. Since the ranking method guarantees uniqueness, I suppose I can melt/dropna/pivot to get what I want.
This approach feels kind of brittle, though, and I'm not sure what I'd do if I had non-unique rank values.
Here's the problem setup:
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
dates = pd.date_range('2024-08-01', '2024-08-07')
contestants = ['Alligator', 'Beryl', 'Chupacabra', 'Dandelion', 'Eggplant', 'Feldspar']
random_scores = rng.random(len(dates) * len(contestants))
random_scores = random_scores.reshape((len(dates), len(contestants)))
scores = pd.DataFrame(random_scores, dates, contestants)
scores.index.name = 'DATE'
scores.columns.name = 'CONTESTANT'
ranks = scores.rank(axis=1, method='first', ascending=False)
top3_ranks = ranks.where(ranks <= 3)
This yields the following output:
>>> scores
CONTESTANT Alligator Beryl Chupacabra Dandelion Eggplant Feldspar
DATE
2024-08-01 0.773956 0.438878 0.858598 0.697368 0.094177 0.975622
2024-08-02 0.761140 0.786064 0.128114 0.450386 0.370798 0.926765
2024-08-03 0.643865 0.822762 0.443414 0.227239 0.554585 0.063817
2024-08-04 0.827631 0.631664 0.758088 0.354526 0.970698 0.893121
2024-08-05 0.778383 0.194639 0.466721 0.043804 0.154289 0.683049
2024-08-06 0.744762 0.967510 0.325825 0.370460 0.469556 0.189471
2024-08-07 0.129922 0.475705 0.226909 0.669814 0.437152 0.832678
>>> top3_ranks
CONTESTANT Alligator Beryl Chupacabra Dandelion Eggplant Feldspar
DATE
2024-08-01 3.0 NaN 2.0 NaN NaN 1.0
2024-08-02 3.0 2.0 NaN NaN NaN 1.0
2024-08-03 2.0 1.0 NaN NaN 3.0 NaN
2024-08-04 3.0 NaN NaN NaN 1.0 2.0
2024-08-05 1.0 NaN 3.0 NaN NaN 2.0
2024-08-06 2.0 1.0 NaN NaN 3.0 NaN
2024-08-07 NaN 3.0 NaN 2.0 NaN 1.0
Here's what I tried:
melted_scores = scores.T.melt(value_name='SCORE')
melted_ranks = top3_ranks.T.melt(value_name='RANK')
melted_scores['RANK'] = melted_ranks['RANK']
desired_output = melted_scores.dropna().pivot(columns='RANK', index='DATE', values='SCORE')
>>> desired_output
RANK 1.0 2.0 3.0
DATE
2024-08-01 0.975622 0.858598 0.773956
2024-08-02 0.926765 0.786064 0.761140
2024-08-03 0.822762 0.643865 0.554585
2024-08-04 0.970698 0.893121 0.827631
2024-08-05 0.778383 0.683049 0.466721
2024-08-06 0.967510 0.744762 0.469556
2024-08-07 0.832678 0.669814 0.475705
Is there a better way to do this that can also handle non-unique rank values?
Pivotting is a correct approach. You could avoid the double melt
by using concat
and stack
before pivot
:
(pd.concat({'score': scores, 'rank': top3_ranks}, axis=1)
.stack().dropna(subset=['rank']).droplevel('CONTESTANT')
.pivot(columns='rank', values='score')
)
Output:
rank 1.0 2.0 3.0
DATE
2024-08-01 0.975622 0.858598 0.773956
2024-08-02 0.926765 0.786064 0.761140
2024-08-03 0.822762 0.643865 0.554585
2024-08-04 0.970698 0.893121 0.827631
2024-08-05 0.778383 0.683049 0.466721
2024-08-06 0.967510 0.744762 0.469556
2024-08-07 0.832678 0.669814 0.475705
If you just want to take the top 3 scores, you could use numpy:
a = scores.to_numpy()
a.sort(axis=1)
N = 3
out = pd.DataFrame(a[:, :-N-1:-1], index=scores.index,
columns=pd.Index(range(1, N+1), name='RANK'))
Output:
RANK 1 2 3
DATE
2024-08-01 0.975622 0.858598 0.773956
2024-08-02 0.926765 0.786064 0.761140
2024-08-03 0.822762 0.643865 0.554585
2024-08-04 0.970698 0.893121 0.827631
2024-08-05 0.778383 0.683049 0.466721
2024-08-06 0.967510 0.744762 0.469556
2024-08-07 0.832678 0.669814 0.475705