From the docs, it's
Vocabulary size: 2**17 ~= 128k
But what is the exact vocab size of the Mistral-Nemo-Instruct-2407 tokenizer model?
Each standard tokenizer has a property called vocab_size
and __len__
can also be used to get the size of the supported vocabulary. The vocabulary is also available as a dictionary via vocab
.
from transformers import AutoTokenizer
secret = 'hf_...'
t = AutoTokenizer.from_pretrained("mistralai/Mistral-Nemo-Instruct-2407", token=secret)
print(t.vocab_size)
print(len(t))
print(len(t.vocab))
Output:
131072
131072
131072