pythonpandasanytree

Anytree to Pandas or tuple conversion with node members as indices


I'd like to build a pandas dataframe or tuple from an anytree object, where each node has a list attribute of members:

from anytree import Node, RenderTree, find_by_attr
from anytree.exporter import DictExporter
from collections import OrderedDict
import pandas as pd
import numpy as np

tree = Node('T0C0',
        n=1000,
        tier=0,
        members=['A','B','C','D'])

Node('T0C0.T1C0',
     parent=find_by_attr(tree, 'T0C0'),
     n=400,
     tier=1,
     members=['B','C'])

Node('T0C0.T1C1',
     parent=find_by_attr(tree, 'T0C0'),
     n=600,
     tier=1,
     members=['A','D'])

Node('T0C0.T1C1.T2C0',
     parent=find_by_attr(tree, 'T0C0.T1C1'),
     n=300,
     tier=2,
     members=['D'])

Node('T0C0.T1C1.T2C1',
     parent=find_by_attr(tree, 'T0C0.T1C1'),
     n=300,
     tier=2,
     members=['A'])

my goal is to produce a dataframe of end-nodes per member, or, even better, tier membership per column like the following:

pd.DataFrame(data=np.array([['T0C0.T1C1.T2C1','T0C0.T1C0','T0C0.T1C0','T0C0.T1C1.T2C0'],
                           ['T0C0','T0C0','T0C0','T0C0'],
                           ['T0C0.T1C1','T0C0.T1C0','T0C0.T1C0','T0C0.T1C1'],
                           ['T0C0.T1C1.T2C1',None,None,'T0C0.T1C1.T2C0']]
                          ),
             index=['A','B','C','D'],columns=['EndCluster','tier0','tier1','tier2'])

I've tried exporting to ordereddict and to json and building data frames directly from there, but "children" becomes a column in the resulting dataframe, with ordered dict entries. I cannot find a way to unnest. Thank you for any help!


Solution

  • The answer turned out easier than I thought. First grab all the end nodes using anytree's findall()

    endnodes = anytree.findall(tree, filter_=lambda node: len(node.children)==0)  
    

    This returns a list of nodes, easier to work with in this case than anytree's OrderedDict conversion

    Finally, populate the dataframe by multiplying member-level attributes by len(member)

    members = []
    tier = []
    endcluster = []
    for item in endnodes:
        members += item.members
        tier += [item.tier] * len(item.members)
        endcluster += [item.name] * len(item.members)
    endf = pd.DataFrame(index=members)
    endf['tier']=tier
    endf['endcluster']=endcluster