I am working with awkward arrays and dumping information to pandas dataframe with multiindex:
>>> import awkward as ak
>>> import pandas as pd
>>> ak_arr = ak.Array([
... {
... 'jet_pt': [2.33e+05, 1.1e+04, 1.47e+05, 1.33e+04, 1.73e+05, 1.07e+04],
... 'jet_num': 6,
... 'bb_dR': [0.83e-01, 0.56e-01, 0.98e-01, 0.32e-01, 0.21e-01, 0.66e-01],
... 'hh_m': 3.25e+05
... },
... {
... 'jet_pt': [1.48e+05, 2.06e+04, 9.93e+04, 1.29e+04],
... 'jet_num': 4,
... 'bb_dR': [0.12e-1, 0.32e-01, 0.45e-01, 0.76e-01, 0.33e-01, 0.54e-01],
... 'hh_m': 2.87e+05
... }
... ])
>>> ak_arr
<Array [{jet_pt: [...], ...}, {...}] type='2 * {jet_pt: var * float64, jet_...'>
>>> df = ak.to_dataframe(ak_arr, how='outer')
>>> df
jet_pt jet_num bb_dR hh_m
entry subentry
0 0 233000.0 6 0.083 325000.0
1 11000.0 6 0.056 325000.0
2 147000.0 6 0.098 325000.0
3 13300.0 6 0.032 325000.0
4 173000.0 6 0.021 325000.0
5 10700.0 6 0.066 325000.0
1 0 148000.0 4 0.012 287000.0
1 20600.0 4 0.032 287000.0
2 99300.0 4 0.045 287000.0
3 12900.0 4 0.076 287000.0
4 NaN 4 0.033 287000.0
5 NaN 4 0.054 287000.0
I would like to know:
jet_pt
entry subentry
0 0 233000.0
1 11000.0
2 147000.0
3 13300.0
4 173000.0
5 10700.0
1 0 148000.0
1 20600.0
2 99300.0
3 12900.0
I can accomplish this result with:
jet_num = df['jet_num'].max(level=0)
jet_z = df['jet_z'].groupby(level=0).apply(lambda x: x[:jet_num[x.name]]).droplevel(0)
but it feels inefficient to me.
bb_dR
entry subentry
0 0 0.083
1 0.056
2 0.098
3 0.032
1 0 0.012
1 0.032
2 0.045
3 0.076
Again, I can achieve the wanted result by doing:
df['bb_dR'].groupby(level=0).apply(lambda x: x[:4]).droplevel(0)
but still think there is a better way.
hh_m
entry subentry
0 0 325000.0
1 0 287000.0
I think for 3, it would also be useful to drop entry and subentry. Thanks in advance.
answer1
cond = df.index.get_level_values(1) < df['jet_num']
out1 = df.loc[cond, ['jet_pt']]
out1
jet_pt
entry subentry
0 0 233000.0
1 11000.0
2 147000.0
3 13300.0
4 173000.0
5 10700.0
1 0 148000.0
1 20600.0
2 99300.0
3 12900.0
answer2
out2 = df.loc[(slice(None), slice(0, 3)), ['bb_dR']]
out2
bb_dR
entry subentry
0 0 0.083
1 0.056
2 0.098
3 0.032
1 0 0.012
1 0.032
2 0.045
3 0.076
answer3
out3 = df.loc[(slice(None), 0), ['hh_m']]
out3
hh_m
entry subentry
0 0 325000.0
1 0 287000.0
If your multi-index does not have integer locations like 0, 1, use groupby
+ cumcount
. In the case of answer1, using cumcount
results in the following code:
cond = df.groupby(level=0).cumcount() < df['jet_num']
out1 = df.loc[cond, ['jet_pt']]
Example Code
import pandas as pd
nan = float('nan')
df = pd.DataFrame({'jet_pt': {(0, 0): 233000.0, (0, 1): 11000.0, (0, 2): 147000.0, (0, 3): 13300.0, (0, 4): 173000.0, (0, 5): 10700.0, (1, 0): 148000.0, (1, 1): 20600.0, (1, 2): 99300.0, (1, 3): 12900.0, (1, 4): nan, (1, 5): nan}, 'jet_num': {(0, 0): 6, (0, 1): 6, (0, 2): 6, (0, 3): 6, (0, 4): 6, (0, 5): 6, (1, 0): 4, (1, 1): 4, (1, 2): 4, (1, 3): 4, (1, 4): 4, (1, 5): 4}, 'bb_dR': {(0, 0): 0.083, (0, 1): 0.056, (0, 2): 0.098, (0, 3): 0.032, (0, 4): 0.021, (0, 5): 0.066, (1, 0): 0.012, (1, 1): 0.032, (1, 2): 0.045, (1, 3): 0.076, (1, 4): 0.033, (1, 5): 0.054}, 'hh_m': {(0, 0): 325000.0, (0, 1): 325000.0, (0, 2): 325000.0, (0, 3): 325000.0, (0, 4): 325000.0, (0, 5): 325000.0, (1, 0): 287000.0, (1, 1): 287000.0, (1, 2): 287000.0, (1, 3): 287000.0, (1, 4): 287000.0, (1, 5): 287000.0}}).rename_axis(['entry', 'subentry'])