huggingface-transformershuggingface-tokenizers

Huggingface tokenizer has two ids for the same token


I'm loading a HF tokenizer, and wanted to stop on the sequence "</|im_end|>", but it looks like the tokenizer has 2 different ids for the same token, is it a bug or supposed to be so?

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("LoneStriker/AlphaMonarch-7B-AWQ", device_map='cuda')
model = AutoModelForCausalLM.from_pretrained("LoneStriker/AlphaMonarch-7B-AWQ", device_map='cuda')

tokenizer.decode(700) #  '</'
tokenizer.decode(1867) # '</'
tokenizer.decode(700) == tokenizer.decode(1867)

Solution

  • The tokenization depends on whether the given token is at a beginning of a word in a text, or in the second-to-last place. Note the difference:

    tokenizer('test </')
    >>> {'input_ids': [1, 1369, 1867], 'attention_mask': [1, 1, 1]}
    
    tokenizer('test</')
    >>> {'input_ids': [1, 1369, 700], 'attention_mask': [1, 1, 1]}
    

    This is actually not that unique, even common tokens have two tokenizations depending on where in the word they appear, e.g. token power:

    tokenizer('power')
    >>> {'input_ids': [1, 1982], 'attention_mask': [1, 1]}
    
    tokenizer('superpower')
    >>> {'input_ids': [1, 2195, 6468], 'attention_mask': [1, 1, 1]}
    

    Some tokenizers include a prefix that signifies that the token only appears in second-to-last position in a word, e.g.:

    bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
    
    bert_tokenizer.decode(1540)
    >>> power
    
    bert_tokenizer.decode(9447)
    >>> ##power
    

    This is true even for the tokenizer in question if you investigate the tokenizer.vocab object (here prefix ā– signifies token that is at the beginning of words), however, I am not sure why it does not transfer to the tokenizer.decode function:

    list(tokenizer.vocab.keys())[list(tokenizer.vocab.values()).index(700)]
    >>> '</'
    
    list(tokenizer.vocab.keys())[list(tokenizer.vocab.values()).index(1867)]
    >>> 'ā–</'
    

    As for stopping the generation, I would investigate which token or sequence of tokens is usually created at the end of sequence, and use that one as stopping criteria (or possibly both, I am not familiar with the concrete implementation).