I know that there is a solution for the similar question: How to get uproot.iterate() output like the root_numpy root2array() output fast But as I understand it is suitable for the flat ROOT TTrees only. I want to have generalized solution for:
vector<double>
All my attempts to apply asjagged
for it failed. Is it possible to avoid jaggedarray
for case (1)?
If the data are fixed-size but stored as vector<double>
, then they're treated as though they were not fixed-size. Uproot would always read them as jagged arrays, and therefore the asarray
method described in the other question is unavailable.
That said, if you have more knowledge than your file's metadata and are willing to try an unsupported hack, you can force the interpretation of your vector<double>
to always have three elements. Here's an example—first, we need a suitable ROOT file:
TFile *f = new TFile("tmp.root", "recreate");
TTree *t = new TTree("t", "t");
std::vector<double> x;
t->Branch("x", &x);
x = {101, 102, 103};
t->Fill();
x = {201, 202, 203};
t->Fill();
x = {301, 302, 303};
t->Fill();
x = {401, 402, 403};
t->Fill();
x = {501, 502, 503};
t->Fill();
x = {601, 602, 603};
t->Fill();
t->Write();
f->Close();
Now, the normal way to read this in Uproot is with a JaggedArray:
>>> import uproot
>>> branch = uproot.open("tmp.root")["t"]["x"]
>>> branch.array()
<JaggedArray [[101.0 102.0 103.0] [201.0 202.0 203.0] [301.0 302.0 303.0] [401.0 402.0 403.0] [501.0 502.0 503.0] [601.0 602.0 603.0]] at 0x7f325ab4eb90>
>>> branch.interpretation
asjagged(asdtype('>f8'), 10)
But this allocates new arrays every time you read it and does some manipulation to find the borders, which you know are regular. (And Uproot does not know that.)
The header for an STL vector happens to be 10 bytes long. After that, its contents are serialized in order, from first to last, before moving on to the next STL vector. For three 8-byte floating point numbers, that's 10 + 8 + 8 + 8 = 34 bytes, always the same size. The fact that it always happens to be the same size is critical for the following.
A NumPy structured array can represent arrays of fixed-size structs as a dtype:
>>> array = branch.array(uproot.asdtype([("header", "S10"), ("x", ">f8"), ("y", ">f8"), ("z", ">f8")]))
>>> array
array([(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 101., 102., 103.),
(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 201., 202., 203.),
(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 301., 302., 303.),
(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 401., 402., 403.),
(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 501., 502., 503.),
(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 601., 602., 603.)],
dtype=[('header', 'S10'), ('x', '<f8'), ('y', '<f8'), ('z', '<f8')])
We don't like looking at the header, so we can strip it out using ordinary NumPy slicing:
>>> array[["x", "y", "z"]]
array([(101., 102., 103.), (201., 202., 203.), (301., 302., 303.),
(401., 402., 403.), (501., 502., 503.), (601., 602., 603.)],
dtype={'names':['x','y','z'], 'formats':['<f8','<f8','<f8'], 'offsets':[10,18,26], 'itemsize':34})
(or just ask for array["x"]
to get a non-structured array). Since we can do that with a plain uproot.asdtype
, we can do it with a preallocated uproot.asarray
.
>>> import numpy as np
>>> big = np.empty(1000, dtype=[("header", "S10"), ("x", ">f8"), ("y", ">f8"), ("z", ">f8")])
>>> branch.array(uproot.asarray(big.dtype, big))
array([(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 101., 102., 103.),
(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 201., 202., 203.),
(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 301., 302., 303.),
(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 401., 402., 403.),
(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 501., 502., 503.),
(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 601., 602., 603.)],
dtype=[('header', 'S10'), ('x', '>f8'), ('y', '>f8'), ('z', '>f8')])
Above, big
has to be at least as large as the dataset you want to read and reading it returns a trimmed view of that array. It does not allocate a new array—the data are stored in big
:
>>> big[["x", "y", "z"]][:10]
array([( 1.01000000e+002, 1.02000000e+002, 1.03000000e+002),
( 2.01000000e+002, 2.02000000e+002, 2.03000000e+002),
( 3.01000000e+002, 3.02000000e+002, 3.03000000e+002),
( 4.01000000e+002, 4.02000000e+002, 4.03000000e+002),
( 5.01000000e+002, 5.02000000e+002, 5.03000000e+002),
( 6.01000000e+002, 6.02000000e+002, 6.03000000e+002),
( 1.22164945e-309, 5.26335088e-310, 1.22167067e-309),
(-5.70498984e+158, 5.97958175e+158, -5.97958175e+158),
(-4.92033505e+032, -4.92033505e+032, -4.92033505e+032),
( 3.77957352e+011, 3.77957221e+011, 3.77957320e+011)],
dtype={'names':['x','y','z'], 'formats':['>f8','>f8','>f8'], 'offsets':[10,18,26], 'itemsize':34})
Everything beyond the 6 events that were read is uninitialized junk, so I'd recommend using the trimmed output from the branch.array
function; this is just to show that big
is being filled—you're not getting a new array.
As per your question (2), if the data are not regular (in number of bytes per entry), then there's no way you can do this. Also, remember that the above technique is not officially supported: you have to know that your data are regular and you have to know that STL vectors have a 10-byte header, which are not things we expect most users to know.