[SOLVED] Huggingface tokenizer has two ids for the same token

Huggingface tokenizer has two ids for the same token

I'm loading a HF tokenizer, and wanted to stop on the sequence "</|im_end|>", but it looks like the tokenizer has 2 different ids for the same token, is it a bug or supposed to be so?

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("LoneStriker/AlphaMonarch-7B-AWQ", device_map='cuda')
model = AutoModelForCausalLM.from_pretrained("LoneStriker/AlphaMonarch-7B-AWQ", device_map='cuda')

tokenizer.decode(700) #  '</'
tokenizer.decode(1867) # '</'
tokenizer.decode(700) == tokenizer.decode(1867)

Solution

The tokenization depends on whether the given token is at a beginning of a word in a text, or in the second-to-last place. Note the difference:

tokenizer('test </')
>>> {'input_ids': [1, 1369, 1867], 'attention_mask': [1, 1, 1]}

tokenizer('test</')
>>> {'input_ids': [1, 1369, 700], 'attention_mask': [1, 1, 1]}

This is actually not that unique, even common tokens have two tokenizations depending on where in the word they appear, e.g. token power:

tokenizer('power')
>>> {'input_ids': [1, 1982], 'attention_mask': [1, 1]}

tokenizer('superpower')
>>> {'input_ids': [1, 2195, 6468], 'attention_mask': [1, 1, 1]}

Some tokenizers include a prefix that signifies that the token only appears in second-to-last position in a word, e.g.:

bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

bert_tokenizer.decode(1540)
>>> power

bert_tokenizer.decode(9447)
>>> ##power

This is true even for the tokenizer in question if you investigate the tokenizer.vocab object (here prefix ▁ signifies token that is at the beginning of words), however, I am not sure why it does not transfer to the tokenizer.decode function:

list(tokenizer.vocab.keys())[list(tokenizer.vocab.values()).index(700)]
>>> '</'

list(tokenizer.vocab.keys())[list(tokenizer.vocab.values()).index(1867)]
>>> '▁</'

As for stopping the generation, I would investigate which token or sequence of tokens is usually created at the end of sequence, and use that one as stopping criteria (or possibly both, I am not familiar with the concrete implementation).