nlp stanford-nlp huggingface-transformers named-entity-recognition

Converting a dataset to CoNLL format. Label remaining tokens with O

I have a manually annotated dataset that contains records in the following format:

{
    "id": 1,
    "text": "At the end of each fiscal quarter, for the four consecutive fiscal quarters ending as of such fiscal quarter end, from the date of the Third Amendment and until December 30, 1996, the Company shall maintain a fixed charge coverage ratio of not less than 1.25 to 1.0.",
    "label": [
        [
            209,
            230,
            "COV_3"
        ],
        [
            379,
            390,
            "VAL_3"
        ]
    ],
}

In the example above, "label" represents the custom entities I have in my dataset. In the example shown above, the phrase fixed charge coverage is located at position [309, 336] and is given the label COV_3. Likewise, the phrase 1.25 to 1.0 is located at [379, 390] and is given the label VAL_3.

Now, I would like to fine-tune some transformer model like BERT on this dataset, however, I realised that the dataset must be in CoNLL format. Or at least, all the tokens of each datapoint must be labelled. Is there any way I can easily label the remaining tokens with label "O" or I can transform this dataset in the CoNLL format?

Solution

You use spacy to tokenize and convert character offset annotation to IOB tags with built-in utility methods. Note that this will skip any spans that don't align to the token boundaries, so you may need to customize the tokenizer or provide the tokenization from another source when creating a Doc.

The character offsets in the question don't line up with the text and are modified below.

# tested with spacy v3.4.3, should work with spacy v3.x
import spacy
from spacy.training.iob_utils import biluo_to_iob, doc_to_biluo_tags

data = {
    "id": 1,
    "text": "At the end of each fiscal quarter, for the four consecutive fiscal quarters ending as of such fiscal quarter end, from the date of the Third Amendment and until December 30, 1996, the Company shall maintain a fixed charge coverage ratio of not less than 1.25 to 1.0.",
    "label": [[209, 230, "COV_3"], [254, 265, "VAL_3"]],
}

nlp = spacy.blank("en")

# tokenize the text to create a doc
doc = nlp(data["text"])

# convert annotation to entity spans and add them to the doc
ents = []
for start, end, label in data["label"]:
    span = doc.char_span(start, end, label=label)
    if span is not None:
        ents.append(span)
    else:
        print(
            "Skipping span (does not align to tokens):",
            start,
            end,
            label,
            doc.text[start:end],
        )
doc.ents = ents

# convert doc annotation to IOB tags
for token, iob_tag in zip(doc, biluo_to_iob(doc_to_biluo_tags(doc))):
    print(token.text + " " + iob_tag)

Output:

At O
the O
end O
of O
each O
fiscal O
quarter O
, O
for O
the O
four O
consecutive O
fiscal O
quarters O
ending O
as O
of O
such O
fiscal O
quarter O
end O
, O
from O
the O
date O
of O
the O
Third O
Amendment O
and O
until O
December O
30 O
, O
1996 O
, O
the O
Company O
shall O
maintain O
a O
fixed B-COV_3
charge I-COV_3
coverage I-COV_3
ratio O
of O
not O
less O
than O
1.25 B-VAL_3
to I-VAL_3
1.0 I-VAL_3
. O

These are the 1st and 4th columns from the 4-column CoNLL 2003 format. You may want to insert blank lines for sentence boundaries or add the special document boundary lines, and you may need some real or placeholder values for the 2nd/3rd tag and chunk columns for use with other tools.