pythonpython-3.xspacynamed-entity-recognitionspacy-3

Can a Named Entity Recognition (NER) spaCy model or any code like an entity ruler around it catch my new further date patterns also as DATE entities?


Anonymization of entities found by a NER model

I try to anonymize files by means of a NER model for German text that sometimes may have a few English words. If I take spaCy NER models for German and English like de_core_news_sm and en_core_web_sm, they find town names or persons, and at least the English model finds "Dezember 2022", but it does not find the full date like "15. Dezember 2022".

Changing the entity recognition

I cannot change the matches of the model. I thought I could take an entity ruler to change the NER model, but the NER model seems to be fixed, and I do not know how my own entity ruler can outweigh the spaCy NER model, and also, how I can get any entity ruler to work at all, even if I disable the NER model. I shifted the entity ruler before the NER model in the spaCy pipeline, but I do not see any new replacements in the output.

Easy example, mainly from the main spaCy guide at Using the entity ruler:

from spacy.lang.de import German

nlp = German()
ruler = nlp.add_pipe("entity_ruler")

patterns = [
    {"label": "DATE", "pattern": [               
        {"lower": {"regex": "(?:0?[1-9]|[12][0-9]|3[01])[\.\s]{1,2}?(jan(?:uar)?|feb(?:ruar)?|mär(?:z)?|apr(?:il)?|mai|jun(?:i)?|jul(?:i)?|aug(?:ust)?|sep(?:t(?:ember)?)?|okt(?:ober)?|nov(?:ember)?|dez(?:ember)?)\.?\s?['`]?\d{0,4}"}},
        {"shape": {"regex": "(?:0?[1-9]|[12][0-9]|3[01])[\.\s]{1,2}?(0?[1-9]|1[0-2])\.?\s?['`]?\d{0,4}"}},
        {"lower": {"regex": "(?:jan(?:uar)?|feb(?:ruar)?|mär(?:z)?|apr(?:il)?|mai|jun(?:i)?|jul(?:i)?|aug(?:ust)?|sep(?:t(?:ember)?)?|okt(?:ober)?|nov(?:ember)?|dez(?:ember)?)\.?\s?['`]?\d{2,4}"}},
        {"lower": {"regex": "(?:januar|feb(?:ruar)?|mär(?:z)?|apr(?:il)?|mai|jun(?:i)?|jul(?:i)?|aug(?:ust)?|sep(?:t(?:ember)?)?|okt(?:ober)?|nov(?:ember)?|dez(?:ember)?\.?)"}},
        {"shape": "dd"},
        {"TEXT": {"in": ["15"]}}
    ]},
    {"label": "ORG1", "pattern": {"LOWER": "apple"}},
    {"label": "GPE1", "pattern": {"LOWER": "san"}},
    {"label": "DATE1", "pattern": {"TEXT": [{"regex": "^(?:0?[1-9]|[12][0-9]|3[01])$"}]}}
]
ruler.add_patterns(patterns)

# Taking the German Dezember here for the test of the German RegEx
doc = nlp("Apple eröffnet ein Büro in San Francisco am 15. Dezember 2022.")

Output:

[]

Question

Can I code around a Named Entity Recognition (NER) spaCy model to catch further date patterns also as DATE entities so that this will outweigh the choice of the NER model?

The aim is that the full "15. Dezember 2022" is found as one DATE entity.


PS

Duplicate?

I found spacy how to add patterns to existing Entity ruler? that tells me to retrain the custom entity ruler and do not add patterns since the questioner has trained the NER model:

I have an existing trained custom NER model with NER and Entity Ruler pipes. I want to update and retrain this existing pipeline.

The question is "how to add patterns to existing Entity ruler?" asks more or less the same as I do here. But since the NER model is a custom one, the answers tell you to retrain the NER model with those patterns. That is why this question here is hopefully not a duplicate: I cannot retrain the NER model since it is a ready-made download from spaCy.

Catastrophic forgetting?

Mind that the answers there tell you not to ever add an entity ruler at all to the NER model if you can retrain your NER model since it may lead to "catastrophic forgetting" of the already trained NER model, read there for more. If that is right, I wonder what I am doing here at all since that would mean that I cannot merge the entity recognition that the spaCy NER model is trained on with another entity ruler. I highly doubt that this is true. Why should I not be able to check a text for some patterns of some entities and then run the spaCy NER model on top of that, and then let the first found entities outweigh the second? Why should that lead to catastrophic forgetting if we speak about two models? Catastrophic forgetting means that the NER model gets retrained on only the new text that I take for the entity ruler. My new input text would be just one sentence with a date. Then, it would be easy to find out whether catastrophic forgetting happens at all. I can just run the pipeline on a sentence with more entities other than dates and see what happens. Yet, I guess that my thoughts here are wrong, so that we do not have two models, but instead one entity recognition model that is a merger of the entity ruler and the NER model. That is also how I understood the entity ruler in the first place. But even then, I can still test this on catastrophic forgetting easily: if the entity recognition gets much worse on a big file, then I know that there is catastrophic forgetting. If you ask me, this sounds too strange to be true. I doubt that the answers of the other question are right.


Solution

  • Main things

    Each match is one label

    You have to list label under label, you cannot just put all of the regex patterns into one label. See a good code that underlines this at Add multiple EntityRuler with spaCy (ValueError: 'entity_ruler' already exists in pipeline).

    Pattern format

    You have to write ORTH, TEXT or LOWER (and not "SHAPE" as I tried it above) and then in a nested bracket REGEX. See full list at spaCy - Matcher - Patterns.

    No embedded spaces in RegEx

    And you cannot RegEx match the already tokenized data with words that have spaces - since the tokenizer has already split the data into tokens by means of these spaces. There aren't any spaces left in the tokenized data. The only way to match them is to break up any regex with embedded \s+ into separate match tokens, see no answer to this question at:

    Squared brackets

    In the spaCy guide on the entity ruler, the code example "explosion/spaCy/master/spacy/pipeline/entityruler.py" that puts two matches in a row instead of a RegEx with embedded spaces is not bad coding, but needed just like that:

    {'label': 'GPE', 'pattern': [{'lower': 'san'}, {'lower': 'francisco'}]}

    Astonishingly, you have to write such squared brackets not just for two or more tokens (which means that they are neighboured, in a row), but even one token needs these squared brackets if you add the "LOWER" attribute at the beginning!(!) You would think that the squared brackets are just a start of a list, which they are, but the list format seems to be needed also for just one match. I checked this with {'label': 'GPE', 'pattern': [{'LOWER': 'apple'}]}, which worked, while it did not work without the squared brackets, the code did not find the word "Apple" as an entity, only "apple".

    Code example

    A good guide that wraps it up is at:

    Fixed code

    With these hints, I could find an answer to the question above.

    from spacy.lang.de import German
    
    nlp = German()
    ruler = nlp.add_pipe("entity_ruler")
    patterns = [
        {"label": "ORG1", "pattern": {"LOWER": "apple"}},
        {"label": "ORG2", "pattern": [{"LOWER": "apple"}]},
        {"label": "GPE1", "pattern": {"LOWER": "san"}},
        {"label": "GPE2", "pattern": [{"LOWER": "san"}]},
        {"label": "GPE4", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]},
        {"label": "DATE1", "pattern": {"TEXT": [{"regex": "^(?:0?[1-9]|[12][0-9]|3[01])$"}]}},
        {"label": "DATE2", "pattern": [{"TEXT": {"regex": "^(?:0?[1-9]|[12][0-9]|3[01])$"}}]},
        {"label": "DATE3", "pattern": [{"TEXT": {"regex": "(?:0?[1-9]|[12][0-9]|3[01])"}}, {"LOWER": {"regex": "(jan(?:uar)?|feb(?:ruar)?|mär(?:z)?|apr(?:il)?|mai|jun(?:i)?|jul(?:i)?|aug(?:ust)?|sep(?:t(?:ember)?)?|okt(?:ober)?|nov(?:ember)?|dez(?:ember)?)"}}, {"TEXT": {"regex": "['`]?\d{2,4}"}}]},
        {"label": "DATE4", "pattern": [{"TEXT": {"regex": "(?:0?[1-9]|[12][0-9]|3[01])"}}, {"LOWER": {"regex": "(jan(?:uar)?|feb(?:ruar)?|mär(?:z)?|apr(?:il)?|mai|jun(?:i)?|jul(?:i)?|aug(?:ust)?|sep(?:t(?:ember)?)?|okt(?:ober)?|nov(?:ember)?|dez(?:ember)?)"}}]},
        {"label": "DATE5", "pattern": [{"LOWER": {"regex": "(?:jan(?:uar)?|feb(?:ruar)?|mär(?:z)?|apr(?:il)?|mai|jun(?:i)?|jul(?:i)?|aug(?:ust)?|sep(?:t(?:ember)?)?|okt(?:ober)?|nov(?:ember)?|dez(?:ember)?)"}}, {"TEXT": {"regex": "['`]?\d{2,4}"}}]},
        {"label": "DATE6", "pattern": [{"LOWER": {"regex": "^(?:januar|feb(?:ruar)?|mär(?:z)?|apr(?:il)?|mai|jun(?:i)?|jul(?:i)?|aug(?:ust)?|sep(?:t(?:ember)?)?|okt(?:ober)?|nov(?:ember)?|dez(?:ember)?)$"}}]}
                ]
    ruler.add_patterns(patterns)
    
    # Taking the German Dezember here for the test of the German RegEx
    doc = nlp("Apple is opening its first big office in San Francisco on 15. Dezember 2022.")
    print([(ent.text, ent.label_) for ent in doc.ents])
    

    Out:

    [('Apple', 'ORG2'), ('San Francisco', 'GPE4'), ('15. Dezember 2022', 'DATE3')]
    

    Mind that the same code, but with an English model will only find "15 Dezember 2022":

    from spacy.lang.en import English
    nlp = English()
    

    Only if you run this on a German model, it will also find "15. Dezember 2022" with the dot. I guess that in English, this dot is read as a full stop of a sentence. Since the sentence tokenizer runs before the word tokenizer, the tokens "15", "Dezember", and "2022" cannot be found together as one match anymore.

    The code above also proves that you do not need to sort the patterns by the number of tokens, like "15. Dezember 2022", "15. Dezember", "Dezember 2022", and "Dezember", since it chooses the match with the most tokens by default. Else it would catch the number of label "DATE2" at first, and then, the full date could not be found anymore.