uproot

How to get uproot.iterate() output identical to the root_numpy root2array() output with jagged arrays


I know that there is a solution for the similar question: How to get uproot.iterate() output like the root_numpy root2array() output fast But as I understand it is suitable for the flat ROOT TTrees only. I want to have generalized solution for:

  1. fixed-size dimension but nested data like particle momentum (px, py, pz) which are represented in the ROOT TTree as the vector<double>
  2. arbitrary-size dimension data

All my attempts to apply asjagged for it failed. Is it possible to avoid jaggedarray for case (1)?


Solution

  • If the data are fixed-size but stored as vector<double>, then they're treated as though they were not fixed-size. Uproot would always read them as jagged arrays, and therefore the asarray method described in the other question is unavailable.

    That said, if you have more knowledge than your file's metadata and are willing to try an unsupported hack, you can force the interpretation of your vector<double> to always have three elements. Here's an example—first, we need a suitable ROOT file:

    TFile *f = new TFile("tmp.root", "recreate");
    TTree *t = new TTree("t", "t");
    std::vector<double> x;
    t->Branch("x", &x);
    x = {101, 102, 103};
    t->Fill();
    x = {201, 202, 203};
    t->Fill();
    x = {301, 302, 303};
    t->Fill();
    x = {401, 402, 403};
    t->Fill();
    x = {501, 502, 503};
    t->Fill();
    x = {601, 602, 603};
    t->Fill();
    t->Write();
    f->Close();
    

    Now, the normal way to read this in Uproot is with a JaggedArray:

    >>> import uproot
    >>> branch = uproot.open("tmp.root")["t"]["x"]
    >>> branch.array()
    <JaggedArray [[101.0 102.0 103.0] [201.0 202.0 203.0] [301.0 302.0 303.0] [401.0 402.0 403.0] [501.0 502.0 503.0] [601.0 602.0 603.0]] at 0x7f325ab4eb90>
    >>> branch.interpretation
    asjagged(asdtype('>f8'), 10)
    

    But this allocates new arrays every time you read it and does some manipulation to find the borders, which you know are regular. (And Uproot does not know that.)

    The header for an STL vector happens to be 10 bytes long. After that, its contents are serialized in order, from first to last, before moving on to the next STL vector. For three 8-byte floating point numbers, that's 10 + 8 + 8 + 8 = 34 bytes, always the same size. The fact that it always happens to be the same size is critical for the following.

    A NumPy structured array can represent arrays of fixed-size structs as a dtype:

    >>> array = branch.array(uproot.asdtype([("header", "S10"), ("x", ">f8"), ("y", ">f8"), ("z", ">f8")]))
    >>> array
    array([(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 101., 102., 103.),
           (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 201., 202., 203.),
           (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 301., 302., 303.),
           (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 401., 402., 403.),
           (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 501., 502., 503.),
           (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 601., 602., 603.)],
          dtype=[('header', 'S10'), ('x', '<f8'), ('y', '<f8'), ('z', '<f8')])
    

    We don't like looking at the header, so we can strip it out using ordinary NumPy slicing:

    >>> array[["x", "y", "z"]]
    array([(101., 102., 103.), (201., 202., 203.), (301., 302., 303.),
           (401., 402., 403.), (501., 502., 503.), (601., 602., 603.)],
          dtype={'names':['x','y','z'], 'formats':['<f8','<f8','<f8'], 'offsets':[10,18,26], 'itemsize':34})
    

    (or just ask for array["x"] to get a non-structured array). Since we can do that with a plain uproot.asdtype, we can do it with a preallocated uproot.asarray.

    >>> import numpy as np
    >>> big = np.empty(1000, dtype=[("header", "S10"), ("x", ">f8"), ("y", ">f8"), ("z", ">f8")])
    >>> branch.array(uproot.asarray(big.dtype, big))
    array([(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 101., 102., 103.),
           (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 201., 202., 203.),
           (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 301., 302., 303.),
           (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 401., 402., 403.),
           (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 501., 502., 503.),
           (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 601., 602., 603.)],
          dtype=[('header', 'S10'), ('x', '>f8'), ('y', '>f8'), ('z', '>f8')])
    

    Above, big has to be at least as large as the dataset you want to read and reading it returns a trimmed view of that array. It does not allocate a new array—the data are stored in big:

    >>> big[["x", "y", "z"]][:10]
    array([( 1.01000000e+002,  1.02000000e+002,  1.03000000e+002),
           ( 2.01000000e+002,  2.02000000e+002,  2.03000000e+002),
           ( 3.01000000e+002,  3.02000000e+002,  3.03000000e+002),
           ( 4.01000000e+002,  4.02000000e+002,  4.03000000e+002),
           ( 5.01000000e+002,  5.02000000e+002,  5.03000000e+002),
           ( 6.01000000e+002,  6.02000000e+002,  6.03000000e+002),
           ( 1.22164945e-309,  5.26335088e-310,  1.22167067e-309),
           (-5.70498984e+158,  5.97958175e+158, -5.97958175e+158),
           (-4.92033505e+032, -4.92033505e+032, -4.92033505e+032),
           ( 3.77957352e+011,  3.77957221e+011,  3.77957320e+011)],
          dtype={'names':['x','y','z'], 'formats':['>f8','>f8','>f8'], 'offsets':[10,18,26], 'itemsize':34})
    

    Everything beyond the 6 events that were read is uninitialized junk, so I'd recommend using the trimmed output from the branch.array function; this is just to show that big is being filled—you're not getting a new array.

    As per your question (2), if the data are not regular (in number of bytes per entry), then there's no way you can do this. Also, remember that the above technique is not officially supported: you have to know that your data are regular and you have to know that STL vectors have a 10-byte header, which are not things we expect most users to know.