huggingface-transformerstokenizelarge-language-modelmistral-ai

What is the exact vocab size of the Mistral-Nemo-Instruct-2407 tokenizer model?


From the docs, it's

Vocabulary size: 2**17 ~= 128k

But what is the exact vocab size of the Mistral-Nemo-Instruct-2407 tokenizer model?


Solution

  • Each standard tokenizer has a property called vocab_size and __len__ can also be used to get the size of the supported vocabulary. The vocabulary is also available as a dictionary via vocab.

    from transformers import AutoTokenizer
    
    secret = 'hf_...'
    t = AutoTokenizer.from_pretrained("mistralai/Mistral-Nemo-Instruct-2407", token=secret)
    
    print(t.vocab_size)
    print(len(t))
    print(len(t.vocab))
    

    Output:

    131072
    131072
    131072