pythonpandasuproot

Loop over pandas.dataframe's entries or sub-entries


I am using uproot to convert a ROOT.TTree into a pandas.dataframe. The structure of the dataframe can be seen below. Note that ‘met’ is an entry level variable, while ‘mu_cells_*’ is a subentry level variable.

Now I want to create a ROOT.TH1 histogram of 'met'. I have asked in the root forum that this can only be done by looping over the dataframe and do ROOT.TH1.Fill() for every entry (not sub-entry to avoid multiple counting), see link. I'd like to ask, what's the best way to do this?

Similarly how do I make a TH1 of ‘mu_cells_e’ now that it has to loop over sub-entry?

Best,

Yosse

                             met  mu_cells_e  mu_cells_side  mu_cells_tower
entry subentry                                                         
0     0         71755.648438  179.995682             -1               6
      1         71755.648438 -308.388519             -1               7
      2         71755.648438   15.558195             -1               8
      3         71755.648438  252.033691             -1               6
      4         71755.648438  459.172119             -1               7
...                      ...         ...            ...             ...
7107  22        26328.087891  611.708374              1               4
      23        26328.087891  -13.317616              1               6
      24        26328.087891   12.681366              1               2
      25        26328.087891   -4.776075              1               4
      26        26328.087891  -17.860764              1               6

[173410 rows x 4 columns]

Solution

  • You'll need to pull out a Series first for any further computation, because ROOT, boost-histogram, or any other tool will not know about Pandas sub-indexing. That can be done like this:

    mu_cells_side = frame.mu_cells_side.xs(0, level='subentry')
    

    Now you can use the TH1's .FillN(len(mu_cells_side), mu_cells_side, ROOT.nullptr) or boost-histogram's fill or NumPy, as it is a normal array at this point (and feel free to call mu_cells_side = np.asarray(mu_cells_side) if any of those care about it being a true np array, but I don't think they do). This will be much faster than trying to loop in Python.

    Having a MWE would have been useful for setting up a similar DataFrame:

    import pandas as pd
    indarr = [[0, 0, 1, 1, 2, 2, 2, 3],
              [0, 1, 0, 1, 0, 1, 2, 0]]
    ind = pd.MultiIndex.from_tuples(list(zip(*indarr)), names=['entry', 'subentry'])
    f = pd.DataFrame({"mu_cells_side":[2,2,3,3,1,1,1,8] , "mu_cells_tower":[1,2,3,4,5,6,7,8]}, index=ind)