pythonnlpspacypresidio

Why does Presidio with spacy nlp engine not recognize organizations and PESEL while spaCy does?


I'm using spaCy with the pl_core_news_lg model to extract named entities from Polish text. It correctly detects both organizations (ORG) and people's names (PER):

import spacy

nlp = spacy.load("pl_core_news_lg")
text = "Jan Kowalski pracuje w IBM i współpracuje z Microsoft oraz Google."

doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]

print(entities)

Output:

[('Jan Kowalski', 'persName'), ('IBM', 'orgName'), ('Microsoft', 'orgName'), ('Google', 'orgName')]

However, when I use Presidio with the pl_core_news_lg model and a configuration file, the recognizers do not correctly detect organizations (ORG) or PESEL numbers, even though they appear in the list of supported entities.

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider

provider = NlpEngineProvider(conf_file="path_to_my_file/nlp_config.yaml") 
nlp_engine = provider.create_engine()

print(f"Supported recognizers (from NLP engine): {nlp_engine.get_supported_entities()}")

supported_languages = list(nlp_engine.get_supported_languages())
registry = RecognizerRegistry(supported_languages=["pl"])
registry.load_predefined_recognizers(["pl"])

print(f"Supported recognizers (from registry): {registry.get_supported_entities(['pl'])}")

analyzer = AnalyzerEngine(
    registry=registry, supported_languages=supported_languages, nlp_engine=nlp_engine
)

results = analyzer.analyze(text, "pl")

for entity in results:
    print(f"Found entity: {entity.entity_type} with score {entity.score}")

Output:

Supported recognizers (from NLP engine): ['ID', 'NRP', 'DATE_TIME', 'PERSON', 'LOCATION']
Supported recognizers (from registry): ['IN_VOTER', 'URL', 'IBAN_CODE', 'CREDIT_CARD', 'DATE_TIME', 'NRP', 'PHONE_NUMBER', 'MEDICAL_LICENSE', 'PERSON', 'IP_ADDRESS', 'ORGANIZATION', 'CRYPTO', 'LOCATION', 'PL_PESEL', 'EMAIL_ADDRESS']

Even though 'ORGANIZATION' and 'PL_PESEL' are listed (org should be listed in from NLP engine) as supported recognizers, Presidio does not detect them correctly in the text.

My config file:

nlp_engine_name: spacy
models:
  - lang_code: pl
    model_name: pl_core_news_lg

ner_model_configuration:
  model_to_presidio_entity_mapping:
    persName: PERSON
    orgName: ORGANIZATION
#    orgName: ORG
    placeName: LOCATION
    geogName: LOCATION
    LOC: LOCATION
    GPE: LOCATION
    FAC: LOCATION
    DATE: DATE_TIME
    TIME: DATE_TIME
    NORP: NRP
    ID: ID

Why does Presidio fail to detect organizations (ORG) and PESEL numbers (PL_PESEL), while spaCy correctly detects them?


Solution

  • The configuration file is missing the 'labels_to_ignore' field, stating that no entities should be ignored in the nlp engine :

      labels_to_ignore:
        - O
    

    On your configuration it would look like this:

    nlp_engine_name: spacy
    models:
      - lang_code: pl
        model_name: pl_core_news_lg
    
    ner_model_configuration:
      labels_to_ignore:
        - O
      model_to_presidio_entity_mapping:
        persName: PERSON
        orgName: ORGANIZATION
    #    orgName: ORG
        placeName: LOCATION
        geogName: LOCATION
        LOC: LOCATION
        GPE: LOCATION
        FAC: LOCATION
        DATE: DATE_TIME
        TIME: DATE_TIME
        NORP: NRP
        ID: ID