allennlpconllsrl

AllenNLP BERT SRL input format ("OntoNotes v. 5.0 formatted")


The goal is to train BERT SRL on another data set. According to configuration, it requires conll-formatted-ontonotes-5.0.

Natively, my data comes in a CoNLL format and I converted it to the conll-formatted-ontonotes-5.0 format of the GitHub edition of OntoNotes v.5.0. Reading the data works and training seems to work, except that precision remains at 0. I suspect that either the encoding of SRL arguments (BOI or phrasal?) or the column structure (other OntoNotes editions in CoNLL format differ here) differ from the expected input. Alternatively, the error may arise because if the role labels are hard-wired in the code. I followed the reference data in using the long form (ARGM-TMP), but you often see the short form (AM-TMP) in other data.

The question is which dataset and format is expected here. I guess it's one of the CoNLL/Skel formats for OntoNotes 5.0 with a restored WORD column, but

Before I start reverse-engineering the SrlReader, does anyone have a data snippet at hand so that I can prepare my data accordingly?

conll-formatted-ontonotes-5.0 version of my data (sample from EWT corpus):

google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0   0   where   WRB (TOP(S(SBARQ(WHADVP*)   -   -   -   -   *   (ARGM-LOC*) *   *   -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0   1   can MD  (SQ*    -   -   -   -   *   (ARGM-MOD*) *   *   -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0   2   I   PRP (NP*)   -   -   -   -   *   (ARG0*) *   *   -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0   3   get VB  (VP*    get 01  -   -   *   (V*)    *   *   -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0   4   morcillas   NNS (NP*)   -   -   -   -   *   (ARG1*) *   *   -

Solution

  • The "native" format is the one under of the CoNLL-2012 edition, see cemantix.org/conll/2012/data.html how to create it.

    The Ontonotes class that reads it may, however, encounter difficulties when parsing "native" CoNLL-2012 data, because the CoNLL-2012 preprocessing scripts can lead to invalid parse trees. Parsing with NLTK will naturally lead to a ValueError such as

    ValueError: Tree.read(): expected ')' but got 'end-of-string'
                at index 1427.
                    "...LT#.#.) ))"
    

    There is no direct way to solve that at the data level, because the string that is parsed is an intermediate representation, but not the original data. If you want to process CoNLL-2012 data, the ValueError has to be caught, cf. https://github.com/allenai/allennlp/issues/5410.