pythonnlpspacylangchainpresidio

Presidio with Langchain Experimental does not detect Polish names


I am using presidio/langchain_experimental to anonymize text in Polish, but it does not detect names (e.g., "Jan Kowalski"). Here is my code:

from presidio_anonymizer import PresidioAnonymizer
from presidio_reversible_anonymizer import PresidioReversibleAnonymizer

config = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "pl", "model_name": "pl_core_news_lg"}],
}

anonymizer = PresidioAnonymizer(analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS"],
                                languages_config=config)

anonymizer_tool = PresidioReversibleAnonymizer(analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS"],
                                               languages_config=config)

text = "Jan Kowalski mieszka w Warszawie i ma e-mail jan.kowalski@example.com."

anonymized_result = anonymizer_tool.anonymize(text)
anon_result = anonymizer.anonymize(text)
deanonymized_result = anonymizer_tool.deanonymize(anonymized_result)

print("Anonymized text:", anonymized_result)
print("Deanonymized text:", deanonymized_result)
print("Map:", anonymizer_tool.deanonymizer_mapping)
print("Anonymized text:", anon_result)

Output:

Anonymized text: Jan Kowalski mieszka w Warszawie i ma e-mail jan.kowalski@example.com.
Deanonymized text: Jan Kowalski mieszka w Warszawie i ma e-mail jan.kowalski@example.com.
Map: {}
Anonymized text: Jan Kowalski mieszka w Warszawie i ma e-mail jan.kowalski@example.com.

I expected the name "Jan Kowalski" and the email address to be anonymized, but the output remains unchanged. I have installed the pl_core_news_lg model using:

python -m spacy download pl_core_news_lg

Am I missing something in the configuration, or does Presidio not support Polish entity recognition properly? Any suggestions on how to make it detect names in Polish?

The interesting thing is that when I use only

anonymizer_tool = PresidioReversibleAnonymizer()

Then the output look like this:

Anonymized text: Elizabeth Tate mieszka w Warszawie i ma e-mail christinemurray@example.net. 
Deanonymized text: Jan Kowalski mieszka w Warszawie i ma e-mail jan.kowalski@example.com. 
Map: {'PERSON': {'Elizabeth Tate': 'Jan Kowalski'}, 'EMAIL_ADDRESS': {'christinemurray@example.net': 'jan.kowalski@example.com'}}

As mentioned below if I use only spaCy:

nlp = spacy.load("pl_core_news_lg")
doc = nlp(text)

Then the output is correct so I guess that it's the problem with presidio itself. Output from spaCy:

Jan Kowalski persName
Warszawie placeName

So I would not like to create custom analyzer for that but use spaCy in Presidio as it works as expected.


Solution

  • After some test I was able to find the solution:

    config = {
        "nlp_engine_name": "spacy",
        "models": [{"lang_code": 'pl', "model_name": "pl_core_news_lg"}],
    }
    spacy_recognizer = SpacyRecognizer(
        supported_language="pl",
        supported_entities=["persName"]
    )
    anonymizer.add_recognizer(spacy_recognizer)
    
    anonymizer_tool = PresidioReversibleAnonymizer(analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "CREDIT_CARD"], languages_config=config)
    

    The output look like this:
    Anonymized text: <persName> mieszka w Warszawie i ma e-mail glenn58@example.org.

    Deanonymized text: Jan Kowalski mieszka w Warszawie i ma e-mail jan.kowalski@example.com.

    Map: {'persName': {'<persName>': 'Jan Kowalski', '<persName_2>': 'Jana Kowalskiego'}, 'EMAIL_ADDRESS': {'glenn58@example.org': 'jan.kowalski@example.com'}}

    You need to directly add SpacyRecognizer with supported_entities formatted according to spaCy's requirements. I believe there's something missing or unclear in the documentation, which is causing the misunderstanding.