The goal is to train BERT SRL on another data set. According to configuration, it requires conll-formatted-ontonotes-5.0
.
Natively, my data comes in a CoNLL format and I converted it to the conll-formatted-ontonotes-5.0 format of the GitHub edition of OntoNotes v.5.0. Reading the data works and training seems to work, except that precision remains at 0. I suspect that either the encoding of SRL arguments (BOI or phrasal?) or the column structure (other OntoNotes editions in CoNLL format differ here) differ from the expected input. Alternatively, the error may arise because if the role labels are hard-wired in the code. I followed the reference data in using the long form (ARGM-TMP
), but you often see the short form (AM-TMP
) in other data.
The question is which dataset and format is expected here. I guess it's one of the CoNLL/Skel formats for OntoNotes 5.0 with a restored WORD column, but
The CoNLL edition doesn't seem to be shipped with the LDC edition of OntoNotes
It does not seem to be the format of the "conll-formatted-ontonotes-5.0" edition of OntoNotes v.5.0 on GitHub provided by the OntoNotes creators.
There is at least one other CoNLL/Skel edition of OntoNotes 5.0 data as part of PropBank. This differs from the other one in leaving out 3 columns and in the encoding of predicates. (For parts of my data, this is the native format.)
The SrlReader documentation mentions BIO (IOBES) encoding. This has been used in other CoNLL editions of PropBank data, indeed, but not in the above-mentioned OntoNotes corpora. Other such formats are the CoNLL-2008 and CoNLL-2009 formats, for example, and different variants.
Before I start reverse-engineering the SrlReader, does anyone have a data snippet at hand so that I can prepare my data accordingly?
conll-formatted-ontonotes-5.0
version of my data (sample from EWT corpus):
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0 0 where WRB (TOP(S(SBARQ(WHADVP*) - - - - * (ARGM-LOC*) * * -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0 1 can MD (SQ* - - - - * (ARGM-MOD*) * * -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0 2 I PRP (NP*) - - - - * (ARG0*) * * -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0 3 get VB (VP* get 01 - - * (V*) * * -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0 4 morcillas NNS (NP*) - - - - * (ARG1*) * * -
The "native" format is the one under of the CoNLL-2012 edition, see cemantix.org/conll/2012/data.html how to create it.
The Ontonotes class that reads it may, however, encounter difficulties when parsing "native" CoNLL-2012 data, because the CoNLL-2012 preprocessing scripts can lead to invalid parse trees. Parsing with NLTK will naturally lead to a ValueError such as
ValueError: Tree.read(): expected ')' but got 'end-of-string'
at index 1427.
"...LT#.#.) ))"
There is no direct way to solve that at the data level, because the string that is parsed is an intermediate representation, but not the original data. If you want to process CoNLL-2012 data, the ValueError has to be caught, cf. https://github.com/allenai/allennlp/issues/5410.