pythonnlplarge-language-modelhuggingface-tokenizers

How do we add/modify the normalizer in a pretrained Huggingface tokenizer?


Given a Huggingface tokenizer that already have a normalizer, e.g. "mistralai/Mistral-7B-v0.1", we can do this to modify the normalizer

import json

from transformers import AutoTokenizer
from tokenizers.normalizers import Sequence, Replace, Prepend

tokenizer_name = "mistralai/Mistral-7B-v0.1"
old_tok = AutoTokenizer.from_pretrained(tokenizer_name)

assert old_tok.backend_tokenizer.normalizer != None

new_normalizer = Sequence(
    [Prepend('▁'), Replace('▁', ' '), Replace("foo", "bar"), Replace('<br>', '\n')]
)

old_tok.backend_tokenizer.normalizer = new_normalizer
new_tokenizdr_name = f"new_tokenizer-{tokenizer_name}"
old_tok.save_pretrained(new_tokenizdr_name)


old_tok = AutoTokenizer.from_pretrained(tokenizer_name)
new_tok = AutoTokenizer.from_pretrained(new_tokenizdr_name)

[out]:

>>> print(' '.join(old_tok.batch_decode(old_tok("I foo you<br>hello world")['input_ids'])))
<s> I foo you < br > hello world

>>> print(' '.join(new_tok.batch_decode(new_tok("I foo you<br>hello world")['input_ids'])))
<s>  I  bar  you 
 hello  world

But when this hot-plug normalizer modification don't always work, if we change it to "mistralai/Mistral-7B-v0.3", it fails to work:

import json

from transformers import AutoTokenizer
from tokenizers.normalizers import Sequence, Replace, Prepend

tokenizer_name = "mistralai/Mistral-7B-v0.3"
old_tok = AutoTokenizer.from_pretrained(tokenizer_name)

new_normalizer = Sequence(
    [Prepend('▁'), Replace('▁', ' '), Replace("foo", "bar"), Replace('<br>', '\n')]
)

old_tok.backend_tokenizer.normalizer = new_normalizer
new_tokenizdr_name = f"new_tokenizer-{tokenizer_name}"
old_tok.save_pretrained(new_tokenizdr_name)


old_tok = AutoTokenizer.from_pretrained(tokenizer_name)
new_tok = AutoTokenizer.from_pretrained(new_tokenizdr_name)

print(' '.join(old_tok.batch_decode(old_tok("I foo you<br>hello world")['input_ids'])))
print(' '.join(new_tok.batch_decode(new_tok("I foo you<br>hello world")['input_ids'])))

[out]:

<s> I foo you < br > hello world
<s> I foo you < br > hello world

How do we add/modify the normalizer in a pretrained Huggingface tokenizer?

Can any normalizer from a pretrained tokenizer be modified or just specific ones?

If the latter, why and how do we know if a pretrained tokenizer's normalizer can be extended or modified?


Solution

  • This looks like a bug. The v0.1 tokenizer has a normalizer by default, which can be seen by looking at the mistral-78-v0.1/tokenizer.json file:

    {
     ...
      "normalizer": {
        "type": "Sequence",
        "normalizers": [
          {
            "type": "Prepend",
            "prepend": "▁"
          },
          {
            "type": "Replace",
            "pattern": {
              "String": " "
            },
            "content": "▁"
          }
        ]
      },
    ...
    }
    

    After modifying the .backend_tokenizer.normalizer object, the modification are saved to the tokenizer.json file.

    In the v0.3 version, the mistral-78-v0.1/tokenizer.json file has no value for the normalizer:

    {
    ...
      "normalizer": null,
    ...
    }
    

    Modifying the normalizer and saving the model does write the changes to the JSON file, but it is not getting picked up on reload using AutoTokenizer.from_pretrained. I am not sure why, but it is entirely possible the tokenizer.model file indicates no normalizer is the default and it simply does not load it.

    However, you can get the tokenizer to load correctly - with the custom normalizer - by instantiating the matched tokenizer class explicitly and passing in the tokenizer.model and tokenizer.json paths along with the values from the tokenizer_config.json file. In this case it is the LlamaTokenizerFast class.

    from transformers import AutoTokenizer, LlamaTokenizerFast, AddedToken
    from tokenizers.normalizers import Sequence, Replace, Prepend
    
    ### load, modify, and save
    tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
    tok.backend_tokenizer.normalizer = Sequence([
        Prepend('_'), 
        Replace('_', ' '), 
        Replace("foo", "bar"), 
        Replace('<br>', '\n')
    ])
    tok.save_pretrained("mistral-7B-v0.3-custom/")
    
    
    ### read in config, construct the AddedToken objects
    with open('mistral-7B-v0.3-custom/tokenizer_config.json') as fp:
        config = json.load(fp)
        config['added_tokens_decoder'] = {
            int(k): AddedToken(**v)
            for k, v in config.pop('added_tokens_decoder').items()
        }
    
    ### load from saved files
    tok_custom = LlamaTokenizerFast(
        'mistral-7B-v0.3-custom/tokenizer.model', 
        'mistral-7B-v0.3-custom/tokenizer.json', 
        **config,
    )
    
    test_str = "I foo you<br>hello world"
    print(' '.join(tok_custom.batch_decode(tok_custom(test_str)['input_ids'])))
    # prints:
    #<s> I bar you
    # hello world
    

    If you don't want to specify the tokenizer class explicitly, you can load the model the the AutoTokenizer, and then load it again using from the resulting class. It is a hacky work-around.

    tok_path = "path/to/mistral-7B-v0.3-custom/"
    with open(f'{tok_path}/tokenizer_config.json') as fp:
        config = json.load(fp)
        config['added_tokens_decoder'] = {
            int(k): AddedToken(**v)
            for k, v in config.pop('added_tokens_decoder').items()
        }
    
    tok = AutoTokenizer.from_pretrained(tok_path).__class__(
        f'{tok_path}/tokenizer.model', 
        f'{tok_path}/tokenizer.json', 
        **config,
    )