pythonnamed-entity-recognitionspacy-3

Add a custom component to pipeline in Spacy 3


I trained a NER model with Spacy3. I would like to add a custom component (add_regex_match) to the pipeline for NER task. The aim is to improve the existing NER results.

This is the code I want to implement:

import spacy
from spacy.language import Language
from spacy.tokens import Span
import re

nlp = spacy.load(r"\src\Spacy3\ner_spacy3_hortisem\training\ml_rule_model")

@Language.component("add_regex_match")
def add_regex_entities(doc):   
    new_ents = []

    label_z = "Zeit"
    regex_expression_z = r"^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]|(?:Januar|März|Mai|Juli|August|Oktober|Dezember)))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2]|(?:Januar|März|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember))\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)(?:0?2|(?:Februar))\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9]|(?:Januar|Februar|März|April|Mai|Juni|Juli|August|September))|(?:1[0-2]|(?:Oktober|November|Dezember)))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$"
    for match in re.finditer(regex_expression_z, doc.text):  # find match in text
        start, end = match.span()  # get the matched token indices
        entity = Span(doc, start, end, label=label_z)
        new_ents.append(entity)
        
    label_b = "BBCH_Stadium"
    regex_expression_b = r"BBCH(\s?\d+)\s?(\/|\-|(bis)?)\s?(\d+)?"
    for match in re.finditer(regex_expression_b, doc.text):  # find match in text
        start, end = match.span()  # get the matched token indices
        entity = Span(doc, start, end, label=label_b)
        new_ents.append(entity)

    doc.ents = new_ents
    return doc
nlp.add_pipe("add_regex_match", after="ner")

nlp.to_disk("./training/ml_rule_regex_model")

doc = nlp("20/03/2021 8 März 2021 BBCH 15, Fliegen, Flugbrand . Brandenburg, in Berlin, Schnecken, BBCH 13-48, BBCH 3 bis 34")

print([(ent.text, ent.label_) for ent in doc.ents])

when I want to evaluate the saved model ml_rule_regex_model using the command line python -m spacy project run evaluate, I got the error: 'ValueError: [E002] Can't find factory for 'add_regex_match' for language German (de). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).'

How should I do it? Has anyone had experience? Thank you very much for your tips.


Solution

  • when I want to evaluate the saved model ml_rule_regex_model using the command line python -m spacy project run evaluate, I got the error...

    You haven't included the project.yml of your spacy project, where the evaluate command is defined. I will assume it calls spacy evaluate? If so, that command has a --code or -c flag to provide a path to a Python file with additional code, such as registered functions. By providing this file and pointing it to the definition of your new add_regex_match component, spaCy will be able to parse the configuration file and use the model.