I'm trying to create a CoNLL-U file using the conllu library as part of a Universal Dependency tagging project I'm working on.
I have a number of sentences in python lists. These contain sub-lists of tokens, lemmata, POS tags, features, etc. For example:
sentence = [['The', 'the', 'DET', ... ], ['big', big', 'ADJ', ... ], ['dog', 'dog', 'NOUN', ...], ...]
I want to automate the process of turning these into CoNLL-U parsed sentences, so I wrote the following function:
from collections import OrderedDict
def compile_sent(sent):
sent_list = list()
for i, tok_data in enumerate(sent):
tok_id = i + 1
tok = tok_data[0]
lemma = tok_data[1]
pos = tok_data[2]
feats = tok_data[3]
compiled_tok = OrderedDict({'id': tok_id, 'form': tok, 'lemma': lemma, 'upostag': pos, 'xpostag': None, 'feats': feats, 'head': None, 'deprel': None, 'deps': None, 'misc': None})
sent_list.append(compiled_tok)
sent_list = sent_list.serialize()
return sent_list
print(compile_sent(sentence))
When I try to run this code I get the following error:
Traceback (most recent call last):
File "/Users/me/PycharmProjects/UDParser/Rough_Work.py", line 103, in <module>
print(compile_sent(sentence))
File "/Users/me/PycharmProjects/UDParser/Rough_Work.py", line 99, in compile_sent
sent_list = sent_list.serialize()
AttributeError: 'list' object has no attribute 'serialize'
The problem is that I'm trying to create a normal list and run the serialize()
method on that. I don't know how to create the type of TokenList
that is created by the library when the parse()
function is run on string in the CoNLL-U file format.
When you try to print that type of list you get the following output:
data = """
# text = The big dog
1 The the DET _ Definite=Def|PronType=Art _ _ _ _
2 big big ADJ _ Degree=Pos _ _ _ _
3 dog dog NOUN _ Number=Sing _ _ _ _
"""
sentences = data.parse()
sentence = sentences[0]
print(sentence)
TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>
Running the serialize()
method on this type of list will turn it back into a CoNLL-U format string like data
in the example above. However, it breaks when you try to run it on a normal python list.
How can I create a TokenList
like this instead of a normal python list object?
Change your sent_list
from a normal list to a TokenList
.
from conllu import TokenList
from collections import OrderedDict
def compile_sent(sent):
sent_list = TokenList()
# ... etc ...
You can view the functions on TokenList
by using help(TokenList)
in a REPL.