pythonuproot

Getting i-th element of branch with uproot


I am using uproot to read root files in Python. My files are such that I am doing this:

ifile = uproot.open(path_to_root_file)

metadata = ifile['Metadata']
waveforms = ifile['Waveforms']

waveforms.show()
waveforms_of_event_50 = waveforms['voltages'].array()[50]
print(waveforms_of_event_50)

and I get as output

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
event                | int32_t                  | AsDtype('>i4')
producer             | std::string              | AsStrings()
voltages             | std::vector<std::vect... | AsObjects(AsVector(True, As...
[[0.00647, 0.00647, 0.00671, 0.00647, ..., 0.00769, 0.00769, 0.00647], ...]

Since the waveforms['voltages'] is an array of array of waveforms, it is heavy and, consequently, the line waveforms_of_event_50 = waveforms['voltages'].array()[50] takes long, because it has to load all the waveforms into memory only to discard all of them but the 50th. Even more, for some of the files this is not even possible because they simply don't fit in memory. What I want to do is instead to get the ith waveform without loading all of them into memory, which I understand is one of the things root files are good for, i.e. something like waveforms['voltages'].get_element_number(50). Is this possible? How?


Solution

  • I could point to my answer here, but StackOverflow likes to have the content locally (in case we ever delete that GitHub repo or anything). Here's a copy of what I said there:

    The best you can do is

    waveforms_of_event_50 = waveforms['voltages'].array(entry_start=50, entry_stop=51)[0]
    

    This will read the minimum physically possible, which is one TBasket. The TBasket might be several kB or maybe a few MB of data. Data in TTrees are not stored in smaller granularity than this, and each chunk is generally compressed, so you have to read the whole thing to decompress it. It will definitely solve you problem with tens of GB, though: I don't think it's possible for a single TBasket to get that large.

    This is not a good pattern if, right after this, you want to read the waveform of event 51, because it's probably in the same TBasket that you just read, and

    waveforms_of_event_51 = waveforms['voltages'].array(entry_start=51, entry_stop=52)[0]
    

    would read it again. If you want to load just one TBasket at a time, see TBranch.basket_entry_start_stop to know where to put your entry_start, entry_stop boundaries.