I'm trying to convert conllu files to Spacy's jsonl format. These conllu files contain paragraph information as specified in Universal Dependencies' website. The problem is that the paragraph information is not carrying over to the jasonl converted file where each paragraph contain a single sentence.
I'm running Spacy version 2.1.3 and using only the obligatory arguments from the spacy convert command, basically python -m spacy input.conllu output_dir
Here are the first few sentences from one of my conllu files (maybe they are not to specification?). For the sake of readability, I'm only pasting the first few tokens of each sentence.
# sent_id = tp2-p1-s1
# O cansaço começou a afetar os vestibulandos no terceiro dia de exame da Fuvest.
1 O O DET DET gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 2 DET _ _
2 cansaço cansaço NOUN NOUN gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 5 NSUBJ _ _
3 começou começar VERB VERB aspect=PERFECTIVE|mood=INDICATIVE|number=SINGULAR|person=THIRD|proper=NOT_PROPER|tense=PAST 5 AUX _ _
# sent_id = tp2-p1-s2
# "Estou meio cheia, mesmo", afirmou a candidata a filosofia Scyla Pereira Gouveia, 19, que fez as provas de biologia e química, de ontem, no colégio Pueri Domus.
1 " " PUNCT PUNCT proper=NOT_PROPER 2 P _ _
2 Estou Estar VERB VERB aspect=IMPERFECTIVE|mood=INDICATIVE|number=SINGULAR|person=FIRST|proper=NOT_PROPER|tense=PRESENT 0 ROOT _ _
3 meio meio NOUN NOUN gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 2 DOBJ _ _
4 cheia cheio ADJ ADJ gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 3 AMOD _ _
# sent_id = tp2-p1-s3
# Seu namorado, Guilherme Schneider, 18, que presta engenharia, faz exame no mesmo local.
1 Seu Seu PRON PRON gender=MASCULINE|number=SINGULAR|person=THIRD|proper=NOT_PROPER 2 DET _ _
2 namorado namorado NOUN NOUN gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 13 NSUBJ _ _
# newpar id = tp2-p2
# sent_id = tp2-p2-s1
# Pelo menos um dos 38.454 convocados para a segunda fase da Fuvest tem fortes motivos para não concluir hoje as provas.
1 Pelo Pelo ADP ADP gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 3 ADVMOD _ _
2 menos menos NOUN NOUN gender=MASCULINE|number=SINGULAR|proper=NOT_PROPER 1 MWE _ _
3 um um NUM NUM gender=MASCULINE|proper=NOT_PROPER 13 NSUBJ _ _
I expected the output of convert to be one file containing 2 lines, one for each paragraph. I'm getting 4 lines, one for each sentence.
I would really like to avoid building a converter of my own, if at all possible.
Thanks in advance
As it turns out, spaCy is prepared to have paragraph information, but, as of the writing of this answer, this is unused information.
For now, in training models that are supposed to learn sentencing, it's necessary to use the --n-sents
option when using the converter