pythonspacytokenize

how to adjust spaCy tokenizer so that it splits number followed by dot at line end in German model


i have a use case in spacy where i want to find phone numbers in German sentences. Unfortunately the tokenizer is not doing the tokenization as expected. When the number is at the end of a sentence the number and the dot is not split into two tokens. English and German Version differ here, see the following code:

import spacy

nlp_en = spacy.blank("en")
nlp_de = spacy.blank("de")

text = "Die Nummer lautet 1234 123444."

doc_en = nlp_en(text)
doc_de = nlp_de(text)

print(doc_en[-1]) #output is: .
print(doc_de[-1]) #output is: 123444.

expected output is: 123444. is split into two tokens. But i also want to use the "de" version as it has other meaningful defaults for German sentences...

my spaCy version: 3.7.4

in a similar case i was able to solve the problem with nlp_de.tokenizer.add_special_case but here i need to match a number that i don't know. and i couldn't find a way to use regex with add_special_case

I also had a look at: Is it possible to change the token split rules for a Spacy tokenizer? which seems promising. but i wasn't able to figure out how to adjust the tokenizer. I guess I should use a custom tokenizer and the information from https://github.com/explosion/spaCy/blob/master/spacy/lang/de/punctuation.py !?


Solution

  • You may use 'suffixes' to fix issues with punctuation. Here is an example:

    import spacy
    
    
    nlp_en = spacy.blank("en")
    nlp_de = spacy.blank("de")
    
    text = "Die Nummer lautet 1234 123448."
    
    
    suffixes = nlp_de.Defaults.suffixes + [r'\.',]
    suffix_regex = spacy.util.compile_suffix_regex(suffixes)
    nlp_de.tokenizer.suffix_search = suffix_regex.search
    
    doc_en = nlp_en(text)
    doc_de = nlp_de(text)
    
    print(doc_en[-1]) #output is: .
    print(doc_de[-1]) #output is: .