I want to add custom tokens to the BertTokenizer. However, the model does not use the new token.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.add_tokens('##oldert')
text = "DocumentOlderThan"
tokens = tokenizer.tokenize(text)
print(tokens)
Output is:
['document', '##old', '##ert', '##han']
But I would expect:
['document', '##oldert', '##han']
How can I make the tokenizer use the new token instead of multiple old ones?
You need to update the tokenizers vocabulary, as such.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
added_tokens = ['##oldert']
# Add new tokens to the tokenizer's vocabulary
tokenizer.vocab.update({token: len(tokenizer.vocab) for token in added_tokens})
tokenizer.ids_to_tokens.update({v: k for k, v in tokenizer.vocab.items()})
text = "DocumentOlderThan"
tokens = tokenizer.tokenize(text)
print(tokens)
Which results in:
['document', '##oldert', '##han']