Say I have a MultiIndex
by MultiIndex
DataFrame
similar to the one generated here (in the real use case the list of races is dynamic and not known ahead of time):
import random
import pandas as pd
random.seed(1)
data_frame_rows = pd.MultiIndex.from_arrays([[], [], []], names=("car", "engine", "wheels"))
data_frame_columns = pd.MultiIndex.from_arrays([[], [], []], names=("group", "subgroup", "details"))
data_frame = pd.DataFrame(index=data_frame_rows, columns=data_frame_columns)
for car in ("mustang", "corvette", "civic"):
for engine in ("normal", "supercharged"):
for wheels in ("normal", "wide"):
data_frame.loc[(car, engine, wheels), ("cost", "", "money ($)")] = int(random.random() * 100)
data_frame.loc[(car, engine, wheels), ("cost", "", "maintenance (minutes)")] = int(random.random() * 60)
for race in ("f1", "indy", "lemans"):
percent_win = random.random()
recommended = percent_win >= 0.8
data_frame.loc[(car, engine, wheels), ("race", race, "win %")] = percent_win
data_frame.loc[(car, engine, wheels), ("race", race, "recommended")] = recommended
Which then will look something like:
group cost race
subgroup f1 indy lemans
details money ($) maintenance (minutes) win % recommended win % recommended win % recommended
car engine wheels
mustang normal normal 13.0 50.0 0.763775 False 0.255069 False 0.495435 False
wide 44.0 39.0 0.788723 False 0.093860 False 0.028347 False
supercharged normal 83.0 25.0 0.762280 False 0.002106 False 0.445387 False
wide 72.0 13.0 0.945271 True 0.901427 True 0.030590 False
corvette normal normal 2.0 32.0 0.939149 True 0.381204 False 0.216599 False
wide 42.0 1.0 0.221692 False 0.437888 False 0.495812 False
supercharged normal 23.0 13.0 0.218781 False 0.459603 False 0.289782 False
wide 2.0 50.0 0.556454 False 0.642294 False 0.185906 False
civic normal normal 99.0 51.0 0.120890 False 0.332695 False 0.721484 False
wide 71.0 56.0 0.422107 False 0.830036 True 0.670306 False
supercharged normal 30.0 35.0 0.882479 True 0.846197 True 0.505284 False
wide 58.0 2.0 0.242740 False 0.797404 False 0.414314 False
I now want to find all rows where the particular car configuration (engine and wheel combination) is the "best" configuration for that car. For example, in this case the civic
has two configurations that are recommended, but the civic
with a supercharged
engine
and normal
wheels
has the highest chance of winning a race (88% in the f1
race). All other cars/configurations that aren't recommended for any race, or have a lower chance of winning any race, I want filtered out. The mustang
and the corvette
have one configuration each which is recommended for any race, so these are the configurations I would choose for those two cars.
So the final output would be each car listed at most once, with the best configuration. If a car has no recommended configurations then I want it out completely.
I've read this multiple times and for the life of me I can't figure it out.
As a starting point to just get the recommended rows I tried something like:
data_frame[(data_frame.loc[:,idx["race",:,"recommended"]]==True)]
But that doesn't seem to filter the rows, just sets things to either NaN or True
group cost race
subgroup f1 indy lemans
details money ($) maintenance (minutes) win % recommended win % recommended win % recommended
car engine wheels
mustang normal normal NaN NaN NaN NaN NaN NaN NaN NaN
wide NaN NaN NaN NaN NaN NaN NaN NaN
supercharged normal NaN NaN NaN NaN NaN NaN NaN NaN
wide NaN NaN NaN True NaN True NaN NaN
corvette normal normal NaN NaN NaN True NaN NaN NaN NaN
wide NaN NaN NaN NaN NaN NaN NaN NaN
supercharged normal NaN NaN NaN NaN NaN NaN NaN NaN
wide NaN NaN NaN NaN NaN NaN NaN NaN
civic normal normal NaN NaN NaN NaN NaN NaN NaN NaN
wide NaN NaN NaN NaN NaN True NaN NaN
supercharged normal NaN NaN NaN True NaN True NaN NaN
wide NaN NaN NaN NaN NaN NaN NaN NaN
Selecting multi-index columns can be complicated and has to be be precise. The code below gives the 4 rows with True
in any recommended
column as required:
selection = data_frame[data_frame.loc[:, ('race', (slice(None)), 'recommended')].any(axis = 1)]
print(selection)
Essentially this is selecting only columns with recommended
then combining the True
values from these into a mask using any()
and finally using this to select the needed rows from the full DF. == True
is not needed as assumed but would be accepted. (slice(None))
is a convenient way to select all columns at Level1 within race
at Level 0; alternatively you could replace this with (['f1', 'indy', 'lemans'])
but can't use :
in this position. The output is as expected:
From here you can further select as required.
The confusion with all the Nan values in your almost-there code was essentially due to not combining several columns to produce a single mask column.