Following the examples from documentation regarding tokenization I have the following code:
import spacy
from spacy.symbols import ORTH, NORM
nlp = spacy.load("en_core_web_sm")
special_case = [{ORTH: "gim", NORM: "give"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)
doc = nlp("gimme that. he gave me that. Going to someplace.")
Then I check the tokenization
doc[0].norm_ # 'give' (as expected)
But the lemmatizer does not return the same output
lemmatizer = nlp.get_pipe("lemmatizer")
lemmatizer.lemmatize(doc[0]) # ['gim'] (expected ['give']
In other hand
lemmatizer.lemmatize(doc[5]) # ['give']
lemmatizer.lemmatize(doc[9]) # [go']
What I'm doing wrong? How to "fix"? In spaCy what is the difference between normalized tokens and lemmatized tokens? How can I "teach" the lemmatization of a single token (as this gim
token in example) ?
In your code you've customized the tokenizer to handle the special case "gimme" and normalize it to "give.
Here's how you can achieve consistent lemmatization results with your custom normalization
import spacy
from spacy.language import Language
from spacy.symbols import ORTH, NORM
nlp = spacy.load("en_core_web_sm")
special_case = [{ORTH: "gim", NORM: "give"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)
# Define a custom lemmatization function
@Language.component(name="custom_lemmatizer")
def custom_lemmatizer_function(doc):
for token in doc:
if token.norm_ == "give":
token.lemma_ = "give"
# Add more custom rules for other words if needed
return doc
# Add the custom lemmatizer to the pipeline
nlp.add_pipe("custom_lemmatizer", name="custom_lemmatizer", after="lemmatizer")
doc = nlp("gimme that. he gave me that. Going to someplace.")
print(doc[0].lemma_) # 'give' (as expected)
print(doc[5].lemma_) # 'give' (as expected)
print(doc[9].lemma_) # 'go' (as expected)