I have a JSON file that'll be used as data for a NER model. It has a sentence and the relevant entities in that specific sentence. I want to create a function that will generate a BIO-labeled string for each sentence according to the entities
for example the following object from the JSON file
{
"request": "I want to fly to New York on the 13.3",
"entities": [
{"start": 16, "end": 23, "text": "New York", "category": "DESTINATION"},
{"start": 32, "end": 35, "text": "13.3", "category": "DATE"}
]
}
"I want to fly to New York on the 13.3" The corresponding BIO label will be "O O O O O B-DESTINATION I-DESTINATION O O B-DATE" where B-category is the beginning of that category I-category stands for inside and O for outside.
I'm looking for a Python code to iterate on each object in the JSON file that will generate a BIO-label for it.
change the JSON format if necessary
This is just a quick implementation for the above task, and many optimizations are possible, which can be explored later, but at first glace here is the function:
def BIO_converter(r, entities):
to_replace = {} # needed to maintain all the NER to be replaced
for i in entities:
sub = r[i['start']+1:i['end']+2].split(' ') # 1 indexed values in entities
if len(sub) > 1:
vals = [f"B-{i['category']}"] + ([f"I-{i['category']}"] * (len(sub)-1))
else:
vals = [f"B-{i['category']}"]
to_replace = to_replace | dict(zip(sub,vals))
r = r.split(' ')
r = [to_replace[i] if i in to_replace else 'O' for i in r ]
return ' '.join(r)
js = {
"request": "I want to fly to New York on the 13.3",
"entities": [
{"start": 16, "end": 23, "text": "New York", "category": "DESTINATION"},
{"start": 32, "end": 35, "text": "13.3", "category": "DATE"}
]
}
BIO_converter(js['request'], js['entities'])
Should output:
O O O O O B-DESTINATION I-DESTINATION O O B-DATE