pandasgroup-byawkward-array

Slicing MultiIndex pandas Dataframe


I am working with awkward arrays and dumping information to pandas dataframe with multiindex:

>>> import awkward as ak
>>> import pandas as pd

>>> ak_arr = ak.Array([
...     {
...       'jet_pt': [2.33e+05, 1.1e+04, 1.47e+05, 1.33e+04, 1.73e+05, 1.07e+04], 
...       'jet_num': 6, 
...       'bb_dR': [0.83e-01, 0.56e-01, 0.98e-01, 0.32e-01, 0.21e-01, 0.66e-01], 
...       'hh_m': 3.25e+05
...     }, 
...     {
...       'jet_pt': [1.48e+05, 2.06e+04, 9.93e+04, 1.29e+04], 
...       'jet_num': 4, 
...       'bb_dR': [0.12e-1, 0.32e-01, 0.45e-01, 0.76e-01, 0.33e-01, 0.54e-01], 
...       'hh_m': 2.87e+05
...     }
... ])
>>> ak_arr
<Array [{jet_pt: [...], ...}, {...}] type='2 * {jet_pt: var * float64, jet_...'>

>>> df = ak.to_dataframe(ak_arr, how='outer')
>>> df
                  jet_pt  jet_num  bb_dR      hh_m
entry subentry                                    
0     0         233000.0      6    0.083  325000.0
      1          11000.0      6    0.056  325000.0
      2         147000.0      6    0.098  325000.0
      3          13300.0      6    0.032  325000.0
      4         173000.0      6    0.021  325000.0
      5          10700.0      6    0.066  325000.0
1     0         148000.0      4    0.012  287000.0
      1          20600.0      4    0.032  287000.0
      2          99300.0      4    0.045  287000.0
      3          12900.0      4    0.076  287000.0
      4              NaN      4    0.033  287000.0
      5              NaN      4    0.054  287000.0

I would like to know:

  1. How can I select jet_pt using the column jet_num to get the number of jets in each entry, so if jet_num is 6, then jet_t will have 6 subentries and so on? Wanted result:
                  jet_pt
entry subentry                                    
0     0         233000.0
      1          11000.0
      2         147000.0
      3          13300.0
      4         173000.0
      5          10700.0
1     0         148000.0
      1          20600.0
      2          99300.0
      3          12900.0

I can accomplish this result with:

jet_num = df['jet_num'].max(level=0)
jet_z = df['jet_z'].groupby(level=0).apply(lambda x: x[:jet_num[x.name]]).droplevel(0)

but it feels inefficient to me.

  1. How can I select the column bb_dR but only the first 4th elements in subentry? Wanted result:
                bb_dR
entry subentry                                    
0     0         0.083
      1         0.056
      2         0.098
      3         0.032
1     0         0.012
      1         0.032
      2         0.045
      3         0.076

Again, I can achieve the wanted result by doing:

df['bb_dR'].groupby(level=0).apply(lambda x: x[:4]).droplevel(0)

but still think there is a better way.

  1. Similar to 2, how can I select just the first row in entry and subentry? Wanted result:
                  hh_m
entry subentry                                    
0     0           325000.0
1     0           287000.0

I think for 3, it would also be useful to drop entry and subentry. Thanks in advance.


Solution

  • answer1

    cond = df.index.get_level_values(1) < df['jet_num']
    out1 = df.loc[cond, ['jet_pt']]
    

    out1

                      jet_pt
    entry subentry          
    0     0         233000.0
          1          11000.0
          2         147000.0
          3          13300.0
          4         173000.0
          5          10700.0
    1     0         148000.0
          1          20600.0
          2          99300.0
          3          12900.0
    

    answer2

    out2 = df.loc[(slice(None), slice(0, 3)), ['bb_dR']]
    

    out2

                    bb_dR
    entry subentry       
    0     0         0.083
          1         0.056
          2         0.098
          3         0.032
    1     0         0.012
          1         0.032
          2         0.045
          3         0.076
    

    answer3

    out3 = df.loc[(slice(None), 0), ['hh_m']]
    

    out3

                        hh_m
    entry subentry          
    0     0         325000.0
    1     0         287000.0
    

    If your multi-index does not have integer locations like 0, 1, use groupby + cumcount. In the case of answer1, using cumcount results in the following code:

    cond = df.groupby(level=0).cumcount() < df['jet_num']
    out1 = df.loc[cond, ['jet_pt']]
    

    Example Code

    import pandas as pd
    nan = float('nan')
    df = pd.DataFrame({'jet_pt': {(0, 0): 233000.0, (0, 1): 11000.0, (0, 2): 147000.0, (0, 3): 13300.0, (0, 4): 173000.0, (0, 5): 10700.0, (1, 0): 148000.0, (1, 1): 20600.0, (1, 2): 99300.0, (1, 3): 12900.0, (1, 4): nan, (1, 5): nan}, 'jet_num': {(0, 0): 6, (0, 1): 6, (0, 2): 6, (0, 3): 6, (0, 4): 6, (0, 5): 6, (1, 0): 4, (1, 1): 4, (1, 2): 4, (1, 3): 4, (1, 4): 4, (1, 5): 4}, 'bb_dR': {(0, 0): 0.083, (0, 1): 0.056, (0, 2): 0.098, (0, 3): 0.032, (0, 4): 0.021, (0, 5): 0.066, (1, 0): 0.012, (1, 1): 0.032, (1, 2): 0.045, (1, 3): 0.076, (1, 4): 0.033, (1, 5): 0.054}, 'hh_m': {(0, 0): 325000.0, (0, 1): 325000.0, (0, 2): 325000.0, (0, 3): 325000.0, (0, 4): 325000.0, (0, 5): 325000.0, (1, 0): 287000.0, (1, 1): 287000.0, (1, 2): 287000.0, (1, 3): 287000.0, (1, 4): 287000.0, (1, 5): 287000.0}}).rename_axis(['entry', 'subentry'])