I have been training an entity linker with Spacy which has 6,000 entities from Wikidata.
The training data contains of 30,000 sentences.
I'm following the notebook provided by Spacy https://github.com/explosion/projects/blob/v3/tutorials/nel_emerson/notebooks/notebook_video.ipynb
The training goes fine and the accuracy seems pretty good, until I test the model out on a string that's clearly incorrect. Such "barack obama is a French born florist living in Spain with 36 cats and two hamsters", but the model predicts the person in this string as https://www.wikidata.org/wiki/Q76
I've tried adding additional parameters into the config, such as n_sents
entity_linker = nlp.add_pipe("entity_linker", config={"incl_prior": False, "n_sents": 6}, last=True)
Is there a way to improve this? it would be better to return NIL instead of a wrong answer. Or is there a confidence score than can be output?
The way the Entity Linker works is that, given all potential candidates for an entity, it picks the most likely one.
The issue you are running into is that your florist is not known to the model, so he is not a candidate. Because the only Barack Obama the model knows about is the former US President, the model can say with certainty that "Barack Obama" must refer to the president.
The model has no mechanism to tell if a reference refers to an entity not in the knowledge base. It will also never abstain, and if there are candidates it will pick one. "NIL" is not an abstention, it's for when a reference has no entries in the knowledge base, so there's nothing to pick from.
This may be clearer if you look at the example project, which uses "Emerson" as an example. The model doesn't decide if "Emerson" is a person it knows or not - it assumes that it must be one of the people it knows, and it has to pick which one is most likely.