pythonnlpspacynamed-entity-recognition

SpaCy: Regex pattern does not work in rule-based matcher


I am trying to define a regular expression to use as text pattern in the entity ruler component in my spaCy model. The aim is to add tokens with "COMP" label whenever it finds words structured like this:

To do so, I use the following method

def add_component_patterns_re(input_references, model_ruler):
    ruler = model_ruler
    ref_patterns = []
    letters = ['V', 'B', 'F', 'K', 'S']

    print("Adding component patterns")
    for ref in input_references.iloc[:, 0]:
        # print(f"Adding references for system: {ref}")
        for letter in letters:
            pattern_text = fr'{ref}(-| ){letter}[0-9]{{3}}'
            pattern = {"TEXT": {"REGEX": fr'{ref}(-| ){letter}[0-9]{{3}}'}}
            ref_patterns.append({"label":"COMP", "pattern":pattern})
    ruler.add_patterns(ref_patterns)

    return ref_patterns

Printing out the added patterns, it seems to me that the output list is correct. So my guess is that I am doing something wrong when defining the pattern to add to the ruler. For information, i've also tried to change the pattern variable as a list entry, like this:

pattern = [{"TEXT": {"REGEX": fr'{ref}(-| ){letter}[0-9]{{3}}'}}]

But the result is the same, it can't seem to get any match.

Does someone have any suggestion? Thanks in advance!


Solution

  • In the end I got

    print(f"Adding references for system: {ref}")
        for letter in letters:
            for nnn in range(1000):
                pattern = f"{ref}-{letter}{nnn:03d}"
                ref_patterns.append({"label": "COMP", "pattern": pattern})
                pattern = f"{ref} {letter}{nnn:03d}"
                ref_patterns.append({"label": "COMP", "pattern": pattern})
    

    For each pattern. The code is lengthier and a tad slower but it does the job just fine!