I'm currently working on a project using spaCy with the German trained pipeline de_dep_news_trf
.
Unfortunately, I'm having issues with named entity recognition (NER).
When I run a simple sentence like "Berlin ist die Hauptstadt von Deutschland. Angela Merkel war die Bundeskanzlerin.", no entities are detected.
I've followed these steps to set up my Python environment (3.12)(Windows) in a PyCharm Community project:
python.exe -m pip install --upgrade pip
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download de_dep_news_trf --timeout 600
pip install spacy[transformers]
Here is a snippet of my code:
import spacy
def process_text_with_spacy(text_to_process):
doc = nlp(text_to_process)
data = {
"text": text_to_process,
"sentences": []
}
for sent in doc.sents:
process_sentence_data = {
"sentence": sent.text,
"entities": []
}
for ent in sent.ents:
process_sentence_data["entities"].append({
"text": ent.text,
"start": ent.start_char,
"end": ent.end_char,
"label": ent.label_
})
data["sentences"].append(process_sentence_data)
return data
nlp = spacy.load('de_dep_news_trf')
sample_text = "Berlin ist die Hauptstadt von Deutschland. Angela Merkel war die Bundeskanzlerin."
processed_data = process_text_with_spacy(sample_text)
print("Text:", sample_text)
for sentence_data in processed_data["sentences"]:
print("Sentence:", sentence_data["sentence"])
print("Entities:", sentence_data["entities"])
Output:
Text: Berlin ist die Hauptstadt von Deutschland. Angela Merkel war die Bundeskanzlerin.
Sentence: Berlin ist die Hauptstadt von Deutschland.
Entities: []
Sentence: Angela Merkel war die Bundeskanzlerin.
Entities: []
When using de_core_news_lg
, the output for each sentence is:
Text: Berlin ist die Hauptstadt von Deutschland. Angela Merkel war die Bundeskanzlerin.
Sentence: Berlin ist die Hauptstadt von Deutschland.
Entities: [{'text': 'Berlin', 'start': 0, 'end': 6, 'label': 'LOC'}, {'text': 'Deutschland', 'start': 30, 'end': 41, 'label': 'LOC'}]
Sentence: Angela Merkel war die Bundeskanzlerin.
Entities: [{'text': 'Angela Merkel', 'start': 43, 'end': 56, 'label': 'PER'}]
However, when I use de_dep_news_trf
, the results are empty.
Model de_dep_news_trf
is selected based on "accuracy" from the SpaCy website.
Could someone explain why de_dep_news_trf
does not return the same result? Is there a specific reason or setting that could cause this difference?
Thank you for your help!
Problem is because this model doesn't have function to recognize entities.
See documentation for de_dep_news_trf - it has components transformer, tagger, morphologizer, parser, lemmatizer, attribute_ruler
but no ner
for EntityRecognizer
So it may need to use one of other models :