algorithmparsingnlpparse-treecyk

Steps to generate parse tree from CYK Algorithm (Natural Language Processing)


I am currently working on a project involving NLP. I have implemented a CKY identifier as given in Jurafsky and Martin (algorithm on page 450). The table so produced actually stores the nonterminals in the table (instead of the usual boolean values). However, the only issue I am getting is to retrieve the parse tree.

Here is an illustration of what my CKY identifier does:

This is my grammar

          S -> NP VP 
          S -> VP 
          NP -> MODAL PRON | DET NP | NOUN VF | NOUN | DET NOUN | DET FILENAME
          MODAL -> 'MD'
          PRON -> 'PPSS' | 'PPO'
          VP -> VERB NP
          VP -> VERB VP
          VP -> ADVERB VP
          VP -> VF
          VERB -> 'VB' | 'VBN'
          NOUN -> 'NN' | 'NP'
          VF -> VERB FILENAME
          FILENAME -> 'NN' | 'NP'
          ADVERB -> 'RB'
          DET -> 'AT'

And this is the algorithm:

for j  from i to LENGTH(words) do
    table[j-1,j] = A where A -> POS(word[j])
    for i from j-2 downto 0
        for k from i+1 to j-1
            table[i,j] = Union(table[i,j], A such that A->BC)
            where B is in table[i,k] and C is in table[k,j]

And this is what my parsing table looks like after being filled:

CKY Table filled as per algorithm mentioned

Now that I know that since S resides in [0,5], the string has been parsed, and that for k = 1 (as per the algorithm given in Martin and Jurafsky), we have S -> table[0][2] table[2][5] i.e. S -> NP VP

The only issue I am getting is that I have been able to retrieve the rules used, but then they are in a jumbled format, i.e. not on the basis of their appearance in parse tree. Can someone suggest an algorithm to retrieve the correct parse tree?

Thankyou.


Solution

  • You should visit recursively the cells of your table and unfold them in the same way you did for the S node until everything is a terminal (so you don't have anything else to unfold). In your example, you first go to cell [0][2]; this is a terminal, you don't have to do anything. Next you go to [2][5], this is a non-terminal made by [2][3] and [3][5]. You visit [2][3], it's a terminal. [3][5] is a non-terminal, made by two terminals. You are done. Here is a demo in Python:

    class Node:
        '''Think this as a cell in your table'''
        def __init__(self, left, right, type, word):
            self.left = left
            self.right = right
            self.type = type
            self.word = word
    
    # Declare terminals
    t1 = Node(None,None,'MOD','can')
    t2 = Node(None,None,'PRON','you')
    t3 = Node(None,None,'VERB', 'eat')
    t4 = Node(None,None,'DET', 'a')
    t5 = Node(None,None,'NOUN','flower')
    
    # Declare non-terminals
    nt1 = Node(t1,t2, 'NP', None)
    nt2 = Node(t4,t5, 'NP', None)
    nt3 = Node(t3,nt2,'VP', None)
    nt4 = Node(nt1,nt3,'S', None)
    
    def unfold(node):
        # Check for a terminal
        if node.left == None and node.right == None:
            return node.word+"_"+node.type
    
        return "["+unfold(node.left)+" "+unfold(node.right)+"]_"+node.type
    
    print unfold(nt4)
    

    And the output:

    [[can_MOD you_PRON]_NP [eat_VERB [a_DET flower_NOUN]_NP]_VP]_S