I'm trying to parse .ConLL files from this github Repo, an example of my parsing code:
from io import open
from conllu import parse_tree_incr
import glob
import os
for filename in glob.glob('./licenses-conll-format/22-MIT/MIT_permissionCopy.conll'):
data_file=open(filename, "r", encoding="utf-8")
for tokentree in parse_incr(data_file):
print(tokentree.serialize())
output :
24 Permission _ NN NN _ 27 nsubjpass _ _
25 is _ VBZ VBZ _ 27 auxpass _ _
26 hereby _ RB RB _ 27 advmod _ _
27 granted _ VBN VBN _ 11 rcmod _ _
28 , _ , , _ 27 punct _ _
29 free _ JJ JJ _ 27 advmod _ _
30 of _ IN IN _ 0 erased _ _
31 charge _ NN NN _ 29 prep_of _ _
this seems to be missing some annotations (I-PERMISSION,B-PERMISSION etc ..) from the original .conll file :
24 Permission _ NN NN _ 27 nsubjpass _ _ B-PERMISSION COPY
25 is _ VBZ VBZ _ 27 auxpass _ _ I-PERMISSION
26 hereby _ RB RB _ 27 advmod _ _ I-PERMISSION
27 granted _ VBN VBN _ 11 rcmod _ _ I-PERMISSION
28 , _ , , _ 27 punct _ _ O
29 free _ JJ JJ _ 27 advmod _ _ I-PERMISSION
30 of _ IN IN _ 0 erased _ _ I-PERMISSION
31 charge _ NN NN _ 29 prep_of _ _ I-PERMISSION
32 , _ , , _ 27 punct _ _ O
Any thoughts on how to get all the annotations ?
You can specify the tuple of fields yourself:
fields = ('id', 'form', 'lemma', 'upostag', 'xpostag', 'feats', 'head', 'deprel', 'deps', 'misc', 'rest')
for tokentree in parse_incr(data_file, fields=fields):
print(tokentree.serialize())
output:
24 Permission _ NN NN _ 27 nsubjpass _ _ B-PERMISSION
25 is _ VBZ VBZ _ 27 auxpass _ _ I-PERMISSION
26 hereby _ RB RB _ 27 advmod _ _ I-PERMISSION
27 granted _ VBN VBN _ 11 rcmod _ _ I-PERMISSION