nlphuggingface-transformershuggingfacehuggingface-tokenizers

How can I adjust the performance of tokenizer?


Working with the tokenizer from the transformers library of Hugging Face. The tokenizer works fine in most cases, but in some cases, it does not.

I'm wondering if I can "adjust" (not train a new tokenizer from scratch) the performance of the tokenizer to handle the bad cases while still maintaining good performance in most cases as it used to.

To be more specific, the type of tokenizer is transformers.XLMRobertaTokenizerFast, which is a unigram tokenizer, and the model is paraphrase-multilingual-mpnet-base-v2.


Solution

  • You can change the tokenizer's vocabulary:

    tokenizer.add_tokens(["asadaf", "sdfsaf"])
    model.resize_token_embeddings(len(tokenizer)) # change input embeddings size
    input_text = "This is asadaf and sdfsaf"
    print(tokenizer(input_text))
    

    As a result, asadaf and sdfsaf would be tokenized as unique words.