pythonhuggingface-transformershuggingface-tokenizershuggingfacehuggingface-hub

Using a custom trained huggingface tokenizer


I’ve trained a custom tokenizer using a custom dataset using this code that’s on the documentation. Is there a method for me to add this tokenizer to the hub and to use it as the other tokenizers by calling the AutoTokenizer.from_pretrained() function? If I can’t do that how can I use the tokenizer to train a custom model from scratch? Thanks for your help!!!

Here's the code below:

from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

folder = 'dataset_unicode'
files = [f"/content/drive/MyDrive/{folder}/{split}.txt" for split in ["test", "train", "valid"]]
tokenizer.train(files, trainer)

from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

# I've tried saving it like this but it doesn't work as I expect it:
tokenizer.save("data/tokenizer-custom.json")

Solution

  • The AutoTokenizer expects a few files in the directory:

    awesometokenizer/
        tokenizer_config.json
        special_tokens_map.json
        tokenizer.json
    

    But the default tokenizer.Tokenizer.save() function only saves the vocab file in awesometokenizer/tokenizer.json, open up the json file and compare the ['model']['vocab'] keys to your json from data/tokenizer-custom.json.

    The simplest way to let AutoTokenizer load .from_pretrained is to follow the answer that @cronoik posted in the comment, using PreTrainedTokenizerFast, i.e. adding a few lines to your existing code:

    from tokenizers import Tokenizer
    from tokenizers.models import BPE
    from tokenizers.trainers import BpeTrainer
    from tokenizers.pre_tokenizers import Whitespace
    from tokenizers.processors import TemplateProcessing
    
    from transformers import PreTrainedTokenizerFast  # <---- Add this line.
    
    
    
    trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
    
    tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = Whitespace()
    
    files = ["big.txt"]  # e.g. training with https://norvig.com/big.txt
    tokenizer.train(files, trainer)
    
    tokenizer.post_processor = TemplateProcessing(
        single="[CLS] $A [SEP]",
        pair="[CLS] $A [SEP] $B:1 [SEP]:1",
        special_tokens=[
            ("[CLS]", tokenizer.token_to_id("[CLS]")),
            ("[SEP]", tokenizer.token_to_id("[SEP]")),
        ],
    )
    
    # Add these lines:
    #     |
    #     |
    #     V
    awesome_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
    awesome_tokenizer.save_pretrained("awesome_tokenizer")
    

    Then you can load the trained tokenizer:

    from transformers import AutoTokenizer
    
    auto_loaded_tokenizer = AutoTokenizer.from_pretrained(
        "awesome_tokenizer", 
        local_files_only=True
    )
    
    

    Note: tokenizers though can be pip installed, is a library in Rust with Python bindings