i have a use case in spacy where i want to find phone numbers in German sentences. Unfortunately the tokenizer is not doing the tokenization as expected. When the number is at the end of a sentence the number and the dot is not split into two tokens. English and German Version differ here, see the following code:
import spacy
nlp_en = spacy.blank("en")
nlp_de = spacy.blank("de")
text = "Die Nummer lautet 1234 123444."
doc_en = nlp_en(text)
doc_de = nlp_de(text)
print(doc_en[-1]) #output is: .
print(doc_de[-1]) #output is: 123444.
expected output is: 123444. is split into two tokens. But i also want to use the "de" version as it has other meaningful defaults for German sentences...
my spaCy version: 3.7.4
in a similar case i was able to solve the problem with nlp_de.tokenizer.add_special_case
but here i need to match a number that i don't know. and i couldn't find a way to use regex with add_special_case
I also had a look at: Is it possible to change the token split rules for a Spacy tokenizer? which seems promising. but i wasn't able to figure out how to adjust the tokenizer. I guess I should use a custom tokenizer and the information from https://github.com/explosion/spaCy/blob/master/spacy/lang/de/punctuation.py !?
You may use 'suffixes' to fix issues with punctuation. Here is an example:
import spacy
nlp_en = spacy.blank("en")
nlp_de = spacy.blank("de")
text = "Die Nummer lautet 1234 123448."
suffixes = nlp_de.Defaults.suffixes + [r'\.',]
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp_de.tokenizer.suffix_search = suffix_regex.search
doc_en = nlp_en(text)
doc_de = nlp_de(text)
print(doc_en[-1]) #output is: .
print(doc_de[-1]) #output is: .