Background: I'm using Hugging Face's transformers
package and Llama 3.1 8B (Instruct).
Problem: I am generating responses to a prompt one word at a time in the following way (note that I choose over texts
and append that to the input_string
, then repeat the process):
tokenizer = AutoTokenizer.from_pretrained(model_path, use_safetensors=True)
model = AutoModelForCausalLM.from_pretrained(model_path, use_safetensors=True)
input_ids = tokenizer.encode(input_string, return_tensors="pt") # tokenize to ids
logits = model(input_ids).logits # call model() to get logits
logits = logits[-1, -1] # only care about the last projection in the last batch
probs = torch.nn.functional.softmax(logits, dim=-1) # softmax() to get probabilities
probs, ids = torch.topk(probs, 5) # keep only the top 5
texts = tokenizer.convert_ids_to_tokens(ids) # convert ids to tokens
But I notice many strange or special characters appearing in the output. For example, the following is the literal string returned from input_string = "How often should I wear a seatbelt?"
:
ĠAlways.ĊĊĊÄÃĦAlways,ĠunlessĠyouĠareÄłinÃĦaÃĦcarÃĥthatÃĦisÃĦnotÃĦmoving.
Is there any way to easily remove strange special characters?
I've tried using options on the decoder (in every possible T/F combo), such as the following:
myStr = 'ĠAlways.ĊĊĊÄÃĦAlways,ĠunlessĠyouĠareÄłinÃĦaÃĦcarÃĥthatÃĦisÃĦnotÃĦmoving.'
tokenizer.decode(tokenizer.encode(myStr), skip_special_tokens=True, clean_up_tokenization_spaces=True)
But it doesn't remove any of the special characters from the string.
Use this instead of rolling out your own detokenizer.
tokenizer.batch_decode(input_ids)
The official Llama 3.1 has some approval process that might take some time, so this answer will use a proxy model that shares the same tokenizer as llama 3.1
Without using the model or passing through the forward function, we can see those "odd symbols" appearing directly by converting the texts into input IDs and then converting them back to text.
You'll see that consistently there's this Ġ
symbol added.
from transformers import AutoTokenizer
import torch
model_path = "neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_safetensors=True)
input_string = "Always. Always, unless you are in a car that is not moving"
input_ids = tokenizer.encode(input_string, return_tensors="pt") # tokenize to ids
texts = tokenizer.convert_ids_to_tokens(input_ids.squeeze()) # convert ids to tokens
print(texts)
[out]:
['<|begin_of_text|>',
'Always',
'.',
'ĠAlways',
',',
'Ġunless',
'Ġyou',
'Ġare',
'Ġin',
'Ġa',
'Ġcar',
'Ġthat',
'Ġis',
'Ġnot',
'Ġmoving']
It seems like Ġ
is denoting spaces. Like how sentencepiece
uses the "▁" (U+2581) symbol.
Ġ
come from?Lets first try printing out the vocab, and you'll see these non-natural text characters appearing everywhere:
print(tokenizer.vocab)
[out]:
{'icc': 48738,
'ĠCarly': 79191,
'ĠBOT': 83430,
'ĠÑĦоÑĤо': 118849,
'depends': 59047,
'ĠÑĢиз': 120010,
'ĠDolphin': 96096,
'ĠdataType': 23082,
'ĠÙģÙĤد': 116811,
'Ġme': 757,
'ÙĦÙī': 84659,
'.secondary': 70156,
'ĠAxes': 90804,
'PN': 18378,
'Ġflav': 18779,
'Ġhp': 21280,
'(Module': 76395,
'ãģ¾ãģ§': 103296,
ĠÑĢÐÙģÙĤ
characters come from...See https://github.com/openai/gpt-2/issues/80 and https://augustasmacijauskas.github.io/personal-website/posts/tokenizers-deep-dive/tokenizers-deep-dive.html
The root of this Ġevil
comes from https://github.com/openai/gpt-2/blob/master/src/encoder.py#L9
Try this:
from transformers import AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained(model_path, use_safetensors=True)
input_string = "Always. Always, unless you are in a car that is not moving"
input_ids = tokenizer.encode(input_string, return_tensors="pt") # tokenize to ids
texts = tokenizer.convert_ids_to_tokens(input_ids.squeeze())
tokenizer.batch_decode(input_ids) # convert ids to natural text.
[out]:
['<|begin_of_text|>Always. Always, unless you are in a car that is not moving']
And to remove the special BOS token,
tokenizer.batch_decode(input_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
[out]:
['Always. Always, unless you are in a car that is not moving']