Perhaps I've skipped over a part of the docs, but what I am trying to determine is a unique ID for each entity in the standard NER toolset. For example:
import spacy
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()
text = "This is a text about Apple Inc based in San Fransisco. "\
"And here is some text about Samsung Corp. "\
"Now, here is some more text about Apple and its products for customers in Norway"
doc = nlp(text)
for ent in doc.ents:
print('ID:{}\t{}\t"{}"\t'.format(ent.label,ent.label_,ent.text,))
displacy.render(doc, jupyter=True, style='ent')
returns:
ID:381 ORG "Apple Inc" ID:382 GPE "San Fransisco" ID:381 ORG "Samsung Corp." ID:381 ORG "Apple" ID:382 GPE "Norway"
I have been looking at ent.ent_id
and ent.ent_id_
but these are inactive according to the docs. I couldn't find anything in ent.root
either.
For example, in GCP NLP each entity is returned with an ⟨entity⟩number that enables you to identify multiple instances of the same entity within a text.
This is a ⟨text⟩2 about ⟨Apple Inc⟩1 based in ⟨San Fransisco⟩4. And here is some ⟨text⟩3 about ⟨Samsung Corp⟩6. Now, here is some more ⟨text⟩8 about ⟨Apple⟩1 and its ⟨products⟩5 for ⟨customers⟩7 in ⟨Norway⟩9"
Does spaCy support something similar? Or is there a way using NLTK or Stanford?
You can use neuralcoref library to get coreference resolution working with SpaCy's models as:
# Load your usual SpaCy model (one of SpaCy English models)
import spacy
nlp = spacy.load('en')
# Add neural coref to SpaCy's pipe
import neuralcoref
neuralcoref.add_to_pipe(nlp)
# You're done. You can now use NeuralCoref as you usually manipulate a SpaCy document annotations.
doc = nlp(u'My sister has a dog. She loves him.')
doc._.has_coref
doc._.coref_clusters
Find the installation and usage instructions here: https://github.com/huggingface/neuralcoref