pythonhuggingface-transformerstokenizelarge-language-modelllama

Removing strange/special characters from outputs llama 3.1 model


Background: I'm using Hugging Face's transformers package and Llama 3.1 8B (Instruct).

Problem: I am generating responses to a prompt one word at a time in the following way (note that I choose over texts and append that to the input_string, then repeat the process):

tokenizer = AutoTokenizer.from_pretrained(model_path, use_safetensors=True)
model = AutoModelForCausalLM.from_pretrained(model_path, use_safetensors=True)

input_ids = tokenizer.encode(input_string, return_tensors="pt") # tokenize to ids
logits = model(input_ids).logits # call model() to get logits
logits = logits[-1, -1] # only care about the last projection in the last batch
probs = torch.nn.functional.softmax(logits, dim=-1) # softmax() to get probabilities
probs, ids = torch.topk(probs, 5) # keep only the top 5
texts = tokenizer.convert_ids_to_tokens(ids) # convert ids to tokens

But I notice many strange or special characters appearing in the output. For example, the following is the literal string returned from input_string = "How often should I wear a seatbelt?":

ĠAlways.ĊĊĊÄÃĦAlways,ĠunlessĠyouĠareÄłinÃĦaÃĦcarÃĥthatÃĦisÃĦnotÃĦmoving.

Is there any way to easily remove strange special characters?

I've tried using options on the decoder (in every possible T/F combo), such as the following:

myStr = 'ĠAlways.ĊĊĊÄÃĦAlways,ĠunlessĠyouĠareÄłinÃĦaÃĦcarÃĥthatÃĦisÃĦnotÃĦmoving.'
tokenizer.decode(tokenizer.encode(myStr), skip_special_tokens=True, clean_up_tokenization_spaces=True)

But it doesn't remove any of the special characters from the string.


Solution

  • TL;DR

    Use this instead of rolling out your own detokenizer.

    tokenizer.batch_decode(input_ids)
    

    In Long

    The official Llama 3.1 has some approval process that might take some time, so this answer will use a proxy model that shares the same tokenizer as llama 3.1

    Without using the model or passing through the forward function, we can see those "odd symbols" appearing directly by converting the texts into input IDs and then converting them back to text.

    You'll see that consistently there's this Ġ symbol added.

    from transformers import AutoTokenizer
    import torch
    
    model_path = "neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8"
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_safetensors=True)
    
    input_string = "Always. Always, unless you are in a car that is not moving"
    input_ids = tokenizer.encode(input_string, return_tensors="pt") # tokenize to ids
    texts = tokenizer.convert_ids_to_tokens(input_ids.squeeze()) # convert ids to tokens
    
    print(texts)
    

    [out]:

    ['<|begin_of_text|>',
     'Always',
     '.',
     'ĠAlways',
     ',',
     'Ġunless',
     'Ġyou',
     'Ġare',
     'Ġin',
     'Ġa',
     'Ġcar',
     'Ġthat',
     'Ġis',
     'Ġnot',
     'Ġmoving']
    

    It seems like Ġ is denoting spaces. Like how sentencepiece uses the "▁" (U+2581) symbol.

    So where does that Ġ come from?

    Lets first try printing out the vocab, and you'll see these non-natural text characters appearing everywhere:

    print(tokenizer.vocab)
    

    [out]:

    {'icc': 48738,
     'ĠCarly': 79191,
     'ĠBOT': 83430,
     'ĠÑĦоÑĤо': 118849,
     'depends': 59047,
     'ĠÑĢиз': 120010,
     'ĠDolphin': 96096,
     'ĠdataType': 23082,
     'ĠÙģÙĤد': 116811,
     'Ġme': 757,
     'ÙĦÙī': 84659,
     '.secondary': 70156,
     'ĠAxes': 90804,
     'PN': 18378,
     'Ġflav': 18779,
     'Ġhp': 21280,
     '(Module': 76395,
     'ãģ¾ãģ§': 103296,
    

    Stop telling me the obvious, just let me know here those ĠÑĢÐÙģÙĤ characters come from...

    See https://github.com/openai/gpt-2/issues/80 and https://augustasmacijauskas.github.io/personal-website/posts/tokenizers-deep-dive/tokenizers-deep-dive.html

    The root of this Ġevil comes from https://github.com/openai/gpt-2/blob/master/src/encoder.py#L9

    So how do I get the decoded tokens in natural text?

    Try this:

    from transformers import AutoTokenizer
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_safetensors=True)
    
    input_string = "Always. Always, unless you are in a car that is not moving"
    input_ids = tokenizer.encode(input_string, return_tensors="pt") # tokenize to ids
    texts = tokenizer.convert_ids_to_tokens(input_ids.squeeze()) 
    
    tokenizer.batch_decode(input_ids) # convert ids to natural text.
    

    [out]:

    ['<|begin_of_text|>Always. Always, unless you are in a car that is not moving']
    

    And to remove the special BOS token,

    tokenizer.batch_decode(input_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    

    [out]:

    ['Always. Always, unless you are in a car that is not moving']