I have a custom Tokenizer built & trained using HuggingFace Tokenizers functions. I can save & load the custom tokenizer to a JSON file without a problem.
Here are the simplified codes:
model = models.WordPiece(unk_token="[UNK]")
tokenizer = Tokenizer(model)
# training from dataset in memory
tokenizer.train_from_iterator(get_training_corpus())
# save to a file
tokenizer.save('my-tokenizer.json')
Here is how I load the custom tokenizer:
tokenizer = Tokenizer.from_file('my-tokenizer.json')
The problem is, can I push my custom tokenizer to HuggingFace Hub? There is no push_to_hub()
function in the Tokenizer
class.
I know if I train from a pre-trained model using the codes below, I can save the new pre-trained model (and push it to HuggingFace Hub) using the following codes:
from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained("a-pretrained-model")
tokenizer = old_tokenizer.train_new_from_iterator(get_training_corpus())
# save the pre-trained tokenizer to the specified folder with config.json and other files
tokenizer.save_pretrained("my-new-shiny-tokenizer")
# push the pre-trained tokenizer to HuggingFace Hub
tokenizer.push_to_hub("my-new-shiny-tokenizer-in-hf")
But I cannot use this approach, as my tokenizer requires a custom decoder, normalizer and pre-tokenizer.
You are almost there!
Your currently implemented tokenizer is based on a class from the tokenizers
library.
Now, you must first wrap your tokenizer into a tokenizer class from the transformers
library. For example:
# Wrap your own tokenizer
from transformers import PreTrainedTokenizerFast
wrapped_tokenizer = PreTrainedTokenizerFast(
tokenizer_file="my-tokenizer.json", # You can load from the tokenizer file
unk_token="[UNK]",
pad_token="[PAD]",
cls_token="[CLS]",
sep_token="[SEP]",
mask_token="[MASK]",
)
# Finally, save your own pretrained tokenizer
wrapped_tokenizer.save_pretrained('my-tokenizer')
Code is taken from this tutorial: https://huggingface.co/learn/nlp-course/chapter6/8#building-a-bpe-tokenizer-from-scratch
However, since you asked a more general question, here are a few more steps you need to complete to push it. I'm assuming below that you're working in a notebook.
In your HuggingFace account,
foo
.In your notebook, enter this:
from huggingface_hub import login
login()
and copy-paste your 'write' token.
Finally, in your notebook, you can push your own tokenizer to the 'foo' repo using:
tokenizer.push_to_hub(repo_id='foo')