I'm using spaCy with the pl_core_news_lg model to extract named entities from Polish text. It correctly detects both organizations (ORG) and people's names (PER):
import spacy
nlp = spacy.load("pl_core_news_lg")
text = "Jan Kowalski pracuje w IBM i współpracuje z Microsoft oraz Google."
doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]
print(entities)
Output:
[('Jan Kowalski', 'persName'), ('IBM', 'orgName'), ('Microsoft', 'orgName'), ('Google', 'orgName')]
However, when I use Presidio with the pl_core_news_lg model and a configuration file, the recognizers do not correctly detect organizations (ORG) or PESEL numbers, even though they appear in the list of supported entities.
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
provider = NlpEngineProvider(conf_file="path_to_my_file/nlp_config.yaml")
nlp_engine = provider.create_engine()
print(f"Supported recognizers (from NLP engine): {nlp_engine.get_supported_entities()}")
supported_languages = list(nlp_engine.get_supported_languages())
registry = RecognizerRegistry(supported_languages=["pl"])
registry.load_predefined_recognizers(["pl"])
print(f"Supported recognizers (from registry): {registry.get_supported_entities(['pl'])}")
analyzer = AnalyzerEngine(
registry=registry, supported_languages=supported_languages, nlp_engine=nlp_engine
)
results = analyzer.analyze(text, "pl")
for entity in results:
print(f"Found entity: {entity.entity_type} with score {entity.score}")
Output:
Supported recognizers (from NLP engine): ['ID', 'NRP', 'DATE_TIME', 'PERSON', 'LOCATION']
Supported recognizers (from registry): ['IN_VOTER', 'URL', 'IBAN_CODE', 'CREDIT_CARD', 'DATE_TIME', 'NRP', 'PHONE_NUMBER', 'MEDICAL_LICENSE', 'PERSON', 'IP_ADDRESS', 'ORGANIZATION', 'CRYPTO', 'LOCATION', 'PL_PESEL', 'EMAIL_ADDRESS']
Even though 'ORGANIZATION' and 'PL_PESEL' are listed (org should be listed in from NLP engine) as supported recognizers, Presidio does not detect them correctly in the text.
My config file:
nlp_engine_name: spacy
models:
- lang_code: pl
model_name: pl_core_news_lg
ner_model_configuration:
model_to_presidio_entity_mapping:
persName: PERSON
orgName: ORGANIZATION
# orgName: ORG
placeName: LOCATION
geogName: LOCATION
LOC: LOCATION
GPE: LOCATION
FAC: LOCATION
DATE: DATE_TIME
TIME: DATE_TIME
NORP: NRP
ID: ID
Why does Presidio fail to detect organizations (ORG) and PESEL numbers (PL_PESEL), while spaCy correctly detects them?
The configuration file is missing the 'labels_to_ignore' field, stating that no entities should be ignored in the nlp engine :
labels_to_ignore:
- O
On your configuration it would look like this:
nlp_engine_name: spacy
models:
- lang_code: pl
model_name: pl_core_news_lg
ner_model_configuration:
labels_to_ignore:
- O
model_to_presidio_entity_mapping:
persName: PERSON
orgName: ORGANIZATION
# orgName: ORG
placeName: LOCATION
geogName: LOCATION
LOC: LOCATION
GPE: LOCATION
FAC: LOCATION
DATE: DATE_TIME
TIME: DATE_TIME
NORP: NRP
ID: ID