pythonphysicsuproot

Continuing to read tree data with compression issue with uproot


When reading data with uproot from a tree compressed with zlib, I find there are some compression errors from zlib, such as: Error -3 while decompressing data: incorrect data check or Error -5 while decompressing data: incomplete or truncated stream. When I open the file in ROOT, I get a similar error from zlib:

R__unzip: error -3 in inflate (zlib)
Error in <TBasket::ReadBasketBuffers>: fNbytes = 20102, fKeylen = 199, fObjlen = 28540, noutot = 0, nout=0, nin=19903, nbuf=28540
Error in <TBranchElement::GetBasket>: File: Stage_1_files/AnalysisResults.31.root at byte:51212830, branch:data.fJetConstituents.fPt, entry:133851, badread=1, nerrors=1, basketnumber=189
...

However, ROOT skips over the problematic entry (or entries) and continues try to read the file. In uproot, the zlib exception is passed up. I catch it, but I'm unable to continue processing the file. There are clearly underlying issues with the file (seems to be from issues with ROOT merging which are out of my control), but is there a way to have uproot identify and skip problematic entries and continue with the rest of the data? I could imagine restricting the entries when reading, but how would I identify them with uproot without trial and error? I could only identify the problematic branch by reading each branch one-by-one in uproot, and that still doesn't identify which entries are the issue (or by checking with ROOT).

Thanks!


Solution

  • Data in TTrees are compressed by baskets, so if a basket's compression is corrupted, no data can be read from that basket but all the other baskets are potentially fine.

    Uproot's array-reading functions give up if any baskets raise an error, but you can use the more low-level TBranch.basket method to read baskets one by one, catching any exceptions along the way. Get a TBranch object from the TTree with dict-like access (e.g. mytree["branch_name"]) and call basket(i, ...) with the same arguments you'd pass to TTree.array but additionally with the basket number i. (They start at 0 and go up to but not including TBranch.numbaskets.)

    There's also a TBranch.iterate_baskets, but it won't help you here because it would stop iterating when it hits an exception. You need to control the loop over baskets to wrap it in try-catch logic.

    There's one more issue: you might need to correlate data from different branches, and their baskets might not begin and end at the same entry numbers. If you ask for TTree.clusters(branches_list) with the branches you're interested in, it will give you entry start and stop numbers at basket boundaries that are common to the set of branches you provide. Using these entry numbers as entrystart and entrystop in the normal TTree.arrays method would read only the requested baskets, and you can put try-catch logic around that.