Using the same counter branch for multiple array branches

I'm trying to output a TTree with the same general format as an input TTree which has the structure:

ttree.show(datatypes.keys())

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
Weight               | float                    | AsDtype('>f4')
E_Beam               | float                    | AsDtype('>f4')
Px_Beam              | float                    | AsDtype('>f4')
Py_Beam              | float                    | AsDtype('>f4')
Pz_Beam              | float                    | AsDtype('>f4')
NumFinalState        | int32_t                  | AsDtype('>i4')
E_FinalState         | float[]                  | AsJagged(AsDtype('>f4'))
Px_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))
Py_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))
Pz_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))

The NumFinalState branch contains the number of elements in all of the *_FinalState array branches. This will always be the case in my work, so it seems wasteful to do the following:

outfile = uproot.recreate("myData_OUT.root")
datatypes = {"Weight": "float32", "E_Beam": "float32", "Px_Beam": "float32", "Py_Beam": "float32", "Pz_Beam": "float32", "NumFinalState": "int32", "E_FinalState": "var * float32", "Px_FinalState": "var * float32", "Py_FinalState": "var * float32", "Pz_FinalState": "var * float32"}
outfile.mktree("kin", datatypes)
outfile["kin"].show()

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
Weight               | float                    | AsDtype('>f4')
E_Beam               | float                    | AsDtype('>f4')
Px_Beam              | float                    | AsDtype('>f4')
Py_Beam              | float                    | AsDtype('>f4')
Pz_Beam              | float                    | AsDtype('>f4')
NumFinalState        | int32_t                  | AsDtype('>i4')
nE_FinalState        | int32_t                  | AsDtype('>i4')
E_FinalState         | float[]                  | AsJagged(AsDtype('>f4'))
nPx_FinalState       | int32_t                  | AsDtype('>i4')
Px_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))
nPy_FinalState       | int32_t                  | AsDtype('>i4')
Py_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))
nPz_FinalState       | int32_t                  | AsDtype('>i4')
Pz_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))

In the documentation, it appears I can use the counter_name argument in mktree to give the counter branches custom names, but it seems to run into trouble if I try to give them the same name:

outfile = uproot.recreate("myData_OUT.root")
datatypes = {"Weight": "float32", "E_Beam": "float32", "Px_Beam": "float32", "Py_Beam": "float32", "Pz_Beam": "float32", "NumFinalState": "int32", "E_FinalState": "var * float32", "Px_FinalState": "var * float32", "Py_FinalState": "var * float32", "Pz_FinalState": "var * float32"}
def counter_name(in_str: str) -> str:
    if "FinalState" in in_str:
        return "NumFinalState"
    return f"n{in_str}"
outfile.mktree("kin", datatypes, counter_name=counter_name)

This code throws an error:

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
/var/folders/td/j379rd296477649qvl1k8n180000gn/T/ipykernel_24226/1447106219.py in <module>
      5         return "NumFinalState"
      6     return f"n{in_str}"
----> 7 outfile.mktree("kin", datatypes, counter_name=counter_name)

/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/uproot/writing/writable.py in mktree(self, name, branch_types, title, counter_name, field_name, initial_basket_capacity, resize_factor)
   1268             path,
   1269             directory._file,
-> 1270             directory._cascading.add_tree(
   1271                 directory._file.sink,
   1272                 treename,

/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/uproot/writing/_cascade.py in add_tree(self, sink, name, title, branch_types, counter_name, field_name, initial_basket_capacity, resize_factor)
   1796             resize_factor,
   1797         )
-> 1798         tree.write_anew(sink)
   1799         return tree
   1800 

/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/uproot/writing/_cascadetree.py in write_anew(self, sink)
   1114                 # reference to fLeafCount
   1115                 out.append(
-> 1116                     uproot.deserialization._read_object_any_format1.pack(
   1117                         datum["counter"]["tleaf_reference_number"]
   1118                     )

error: required argument is not an integer

and I figure this error is related to uproot trying to make two branches with the same name. Is there any way to get around this? It'll probably be okay to just create a NumFinalState branch manually, since it gets read in by a subsequent program, but just in terms of compactness, it would be nice to not create a bunch of unnecessary branches.

Solution

Uproot makes one counter branch for each Awkward Array in the dict it's given. Since your Awkward Arrays are arrays of lists of numbers, they're all presumed to have different counters. There isn't a way to manually force them to share a counter; the way it's supposed to work is to join them all into one Awkward Array, which Uproot will recognize as something that should have one counter.

So suppose you have

>>> import awkward as ak
>>> import uproot
>>> E_FinalState = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> Px_FinalState = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> Py_FinalState = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> Pz_FinalState = ak.Array([[11, 22, 33], [], [44, 55]])

Passing each of these individually into the output TTree will make a nE_finalState, nPx_FinalState, etc., as you've seen. So make them one array with ak.zip:

>>> finalstate = ak.zip({"E": E_FinalState, "px": Px_FinalState, "py": Py_FinalState, "pz": Pz_FinalState})
>>> finalstate
<Array [[{E: 1.1, px: 1.1, ... pz: 55}]] type='3 * var * {"E": float64, "px": fl...'>
>>> print(finalstate.type)
3 * var * {"E": float64, "px": float64, "py": float64, "pz": int64}

The key thing is that the type is now number of entries * var * {record}, rather than number of entries * var * float64, individually for each array. It's in the ak.zip function that you find out whether they really do have the same number of entries. (It's a one-time cross-check; the creation of the finalstate array is itself zero-copy.)

Now you can use this when writing a TTree:

>>> outfile = uproot.recreate("myData_OUT.root")
>>> outfile["kin"] = {"finalstate": finalstate}
>>> outfile["kin"].show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nfinalstate          | int32_t                  | AsDtype('>i4')
finalstate_E         | double[]                 | AsJagged(AsDtype('>f8'))
finalstate_px        | double[]                 | AsJagged(AsDtype('>f8'))
finalstate_py        | double[]                 | AsJagged(AsDtype('>f8'))
finalstate_pz        | int64_t[]                | AsJagged(AsDtype('>i8'))

The counter_name and field_name arguments only control the generation of names. By default, they follow a convention in which the "finalstate" name in the dict gets prepended by "n" for the counter and appended by "_" and the name of each field for the branches (CMS NanoAOD conventions). Those arguments exist so that you can apply a different naming convention, but they don't actually change which counter branches get created. In fact, defining these functions so that they produce the same name might trigger a confusing error message like this—I don't think it's an explicitly checked case.

Oh, and you should also be able to use the uproot.WritableDirectory.mktree constructor (which makes a TTree without data, so that each uproot.WritableTree.extend call can be like each other). The dict syntax would be

>>> outfile.mktree("kin", {"finalstate": finalstate.type})
<WritableTree '/kin' at 0x7f48ff139550>
>>> outfile["kin"].extend({"finalstate": finalstate})

i.e. use the finalstate.type, rather than the finalstate array itself.