I'm using a custom normalizer for my custom tokenizer.
The custom normalizer is as follows:
class CustomNormalizer:
def normalize(self, normalized: NormalizedString):
# Most of these can be replaced by a `Sequence` combining some provided Normalizer,
# (ie Sequence([ NFKC(), Replace(Regex("\s+"), " "), Lowercase() ])
# and it should be the prefered way. That being said, here is an example of the kind
# of things that can be done here:
try:
if normalized is None:
noramlized = NormalizedString("")
else:
normalized.nfkc()
normalized.filter(lambda char: not char.isnumeric())
normalized.replace(Regex("\s+"), " ")
normalized.lowercase()
except TypeError as te:
print("CustomNormalizer TypeError:", te)
print(normalized)
which the codes are adopted here: https://github.com/huggingface/tokenizers/blob/b24a2fc1781d5da4e6ebcd3ecb5b91edffc0a05f/bindings/python/examples/custom_components.py
When I use this normalizer with a custom Tokenizer (codes below) and try to save the trained tokenizer, it said:
Exception: Custom Normalizer cannot be serialized
The custom tokenizer code is as follows:
model = models.WordPiece(unk_token="[UNK]")
tokenizer = Tokenizer(model)
tokenizer.normalizer = Normalizer.custom(CustomNormalizer())
trainer = trainers.WordPieceTrainer(
vocab_size=2500,
special_tokens=special_tokens,
show_progress=True
)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer, length=len(dataset))
# Save the Tokenizer result
tokenizer.save('saved.json') # in this line, it gives Exception
How can I resolve this exception?
With reference to this comment in GitHub, the solution is to replace the custom Normalizer, Decoder and Pre-Tokenizer with standard ones before saving, and replace them with custom ones after loading from a file.
The codes are below (any default normalizer, pre-tokenizer and decoder will do, the ones selected below are just an example):
Saving
from tokenizers.normalizers import NFKC
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.decoders import WordPiece
tokenizer.normalizer = NFKC()
tokenizer.pre_tokenizer = Whitespace()
tokenizer.decoder = WordPiece()
tokenizer.save("my_tokenizer.json")
Loading
from tokenizers import Tokenizer
tokenizer_loaded = Tokenizer.from_file("my_tokenizer.json")
tokenizer_loaded.normalizer = Normalizer.custom(CustomNormalizer())
tokenizer_loaded.pre_tokenizer = PreTokenizer.custom(PyCantonesePreTokenizer())
tokenizer_loaded.decoder = Decoder.custom(CustomDecoder())
Hope it helps someone in the future.