pythonparsingnlpconll

Parsing CoNLL-U missing annotation (misc)


I'm trying to parse .ConLL files from this github Repo, an example of my parsing code:

from io import open
from conllu import parse_tree_incr
import glob
import os

for filename in glob.glob('./licenses-conll-format/22-MIT/MIT_permissionCopy.conll'):
    data_file=open(filename, "r", encoding="utf-8")
    for tokentree in parse_incr(data_file):
        print(tokentree.serialize())

output :

24  Permission  _   NN  NN  _   27  nsubjpass   _   _
25  is  _   VBZ VBZ _   27  auxpass _   _
26  hereby  _   RB  RB  _   27  advmod  _   _
27  granted _   VBN VBN _   11  rcmod   _   _
28  ,   _   ,   ,   _   27  punct   _   _
29  free    _   JJ  JJ  _   27  advmod  _   _
30  of  _   IN  IN  _   0   erased  _   _
31  charge  _   NN  NN  _   29  prep_of _   _

this seems to be missing some annotations (I-PERMISSION,B-PERMISSION etc ..) from the original .conll file :

24  Permission  _   NN  NN  _   27  nsubjpass   _   _   B-PERMISSION    COPY
25  is  _   VBZ VBZ _   27  auxpass _   _   I-PERMISSION
26  hereby  _   RB  RB  _   27  advmod  _   _   I-PERMISSION
27  granted _   VBN VBN _   11  rcmod   _   _   I-PERMISSION
28  ,   _   ,   ,   _   27  punct   _   _   O
29  free    _   JJ  JJ  _   27  advmod  _   _   I-PERMISSION
30  of  _   IN  IN  _   0   erased  _   _   I-PERMISSION
31  charge  _   NN  NN  _   29  prep_of _   _   I-PERMISSION
32  ,   _   ,   ,   _   27  punct   _   _   O

Any thoughts on how to get all the annotations ?


Solution

  • You can specify the tuple of fields yourself:

    fields = ('id', 'form', 'lemma', 'upostag', 'xpostag', 'feats', 'head', 'deprel', 'deps', 'misc', 'rest')
    for tokentree in parse_incr(data_file, fields=fields):
        print(tokentree.serialize())
    

    output:

    24  Permission  _   NN  NN  _   27  nsubjpass   _   _   B-PERMISSION
    25  is  _   VBZ VBZ _   27  auxpass _   _   I-PERMISSION
    26  hereby  _   RB  RB  _   27  advmod  _   _   I-PERMISSION
    27  granted _   VBN VBN _   11  rcmod   _   _   I-PERMISSION