pythontreegraph-theorydiscourseedge-list

NLTK discourse tree to edge list


I have following string:

dt = ' ( NS-elaboration ( EDU 1 )  ( NS-elaboration ( EDU 2 )  ( NS-elaboration ( EDU 3 )  ( EDU 4 )  )  )  ) '

I can convert it to an NLTK tree as follows:

from nltk import Tree
t = Tree.fromstring(dt)

This tree is illustrated in this link.

What I want is the edge list of this tree. Something similar to the following:

NS-elaboration0    EDU1
NS-elaboration0    NS-elaboration1
NS-elaboration1    EDU2
NS-elaboration1    NS-elaboration2
NS-elaboration2    EDU3
NS-elaboration2    EDU4

where the number after NS-elaboration is the height of the tree.


Solution

  • I tried to find a builtin for this, but in the end I just built the following algorithm:

    Code:

    from nltk import Tree
    
    def get_edges(tree, i):
        from_str = f"{tree.label()}{i}"
        children = [f"{child.label()}{child.leaves()[0]}" for child in tree if isinstance(child, Tree) and child.height() == 2]
        children.extend([f"{child.label()}{i+1}" for child in tree if isinstance(child, Tree) and child.height() > 2])
        return [(from_str, child) for child in children]
    
    def tree_to_edges(tree):
        height = 0
        rv = []
        to_check = [tree]
        while to_check:
            tree_to_check = to_check.pop(0)
            rv.extend(get_edges(tree_to_check, height))
            height += 1
            to_check.extend([child for child in tree_to_check if isinstance(child, Tree) and child.height() > 2])
        return rv
    

    Usage:

    >>> dt = ' ( NS-elaboration ( EDU 1 )  ( NS-elaboration ( EDU 2 )  ( NS-elaboration ( EDU 3 )  ( EDU 4 )  )  )  ) '
    >>> t = Tree.fromstring(dt)
    >>> tree_to_edges(t)
    [('NS-elaboration0', 'EDU1'),
     ('NS-elaboration0', 'NS-elaboration1'),
     ('NS-elaboration1', 'EDU2'),
     ('NS-elaboration1', 'NS-elaboration2'),
     ('NS-elaboration2', 'EDU3'),
     ('NS-elaboration2', 'EDU4')]