I'm loading a HF tokenizer, and wanted to stop on the sequence "</|im_end|>", but it looks like the tokenizer has 2 different ids for the same token, is it a bug or supposed to be so?
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("LoneStriker/AlphaMonarch-7B-AWQ", device_map='cuda')
model = AutoModelForCausalLM.from_pretrained("LoneStriker/AlphaMonarch-7B-AWQ", device_map='cuda')
tokenizer.decode(700) # '</'
tokenizer.decode(1867) # '</'
tokenizer.decode(700) == tokenizer.decode(1867)
The tokenization depends on whether the given token is at a beginning of a word in a text, or in the second-to-last place. Note the difference:
tokenizer('test </')
>>> {'input_ids': [1, 1369, 1867], 'attention_mask': [1, 1, 1]}
tokenizer('test</')
>>> {'input_ids': [1, 1369, 700], 'attention_mask': [1, 1, 1]}
This is actually not that unique, even common tokens have two tokenizations depending on where in the word they appear, e.g. token power
:
tokenizer('power')
>>> {'input_ids': [1, 1982], 'attention_mask': [1, 1]}
tokenizer('superpower')
>>> {'input_ids': [1, 2195, 6468], 'attention_mask': [1, 1, 1]}
Some tokenizers include a prefix that signifies that the token only appears in second-to-last position in a word, e.g.:
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
bert_tokenizer.decode(1540)
>>> power
bert_tokenizer.decode(9447)
>>> ##power
This is true even for the tokenizer in question if you investigate the tokenizer.vocab
object (here prefix ā
signifies token that is at the beginning of words), however, I am not sure why it does not transfer to the tokenizer.decode
function:
list(tokenizer.vocab.keys())[list(tokenizer.vocab.values()).index(700)]
>>> '</'
list(tokenizer.vocab.keys())[list(tokenizer.vocab.values()).index(1867)]
>>> 'ā</'
As for stopping the generation, I would investigate which token or sequence of tokens is usually created at the end of sequence, and use that one as stopping criteria (or possibly both, I am not familiar with the concrete implementation).