Given a Huggingface tokenizer that already have a normalizer, e.g. "mistralai/Mistral-7B-v0.1"
, we can do this to modify the normalizer
import json
from transformers import AutoTokenizer
from tokenizers.normalizers import Sequence, Replace, Prepend
tokenizer_name = "mistralai/Mistral-7B-v0.1"
old_tok = AutoTokenizer.from_pretrained(tokenizer_name)
assert old_tok.backend_tokenizer.normalizer != None
new_normalizer = Sequence(
[Prepend('▁'), Replace('▁', ' '), Replace("foo", "bar"), Replace('<br>', '\n')]
)
old_tok.backend_tokenizer.normalizer = new_normalizer
new_tokenizdr_name = f"new_tokenizer-{tokenizer_name}"
old_tok.save_pretrained(new_tokenizdr_name)
old_tok = AutoTokenizer.from_pretrained(tokenizer_name)
new_tok = AutoTokenizer.from_pretrained(new_tokenizdr_name)
[out]:
>>> print(' '.join(old_tok.batch_decode(old_tok("I foo you<br>hello world")['input_ids'])))
<s> I foo you < br > hello world
>>> print(' '.join(new_tok.batch_decode(new_tok("I foo you<br>hello world")['input_ids'])))
<s> I bar you
hello world
But when this hot-plug normalizer modification don't always work, if we change it to "mistralai/Mistral-7B-v0.3"
, it fails to work:
import json
from transformers import AutoTokenizer
from tokenizers.normalizers import Sequence, Replace, Prepend
tokenizer_name = "mistralai/Mistral-7B-v0.3"
old_tok = AutoTokenizer.from_pretrained(tokenizer_name)
new_normalizer = Sequence(
[Prepend('▁'), Replace('▁', ' '), Replace("foo", "bar"), Replace('<br>', '\n')]
)
old_tok.backend_tokenizer.normalizer = new_normalizer
new_tokenizdr_name = f"new_tokenizer-{tokenizer_name}"
old_tok.save_pretrained(new_tokenizdr_name)
old_tok = AutoTokenizer.from_pretrained(tokenizer_name)
new_tok = AutoTokenizer.from_pretrained(new_tokenizdr_name)
print(' '.join(old_tok.batch_decode(old_tok("I foo you<br>hello world")['input_ids'])))
print(' '.join(new_tok.batch_decode(new_tok("I foo you<br>hello world")['input_ids'])))
[out]:
<s> I foo you < br > hello world
<s> I foo you < br > hello world
Can any normalizer from a pretrained tokenizer be modified or just specific ones?
If the latter, why and how do we know if a pretrained tokenizer's normalizer can be extended or modified?
This looks like a bug. The v0.1 tokenizer has a normalizer by default, which can be seen by looking at the mistral-78-v0.1/tokenizer.json
file:
{
...
"normalizer": {
"type": "Sequence",
"normalizers": [
{
"type": "Prepend",
"prepend": "▁"
},
{
"type": "Replace",
"pattern": {
"String": " "
},
"content": "▁"
}
]
},
...
}
After modifying the .backend_tokenizer.normalizer
object, the modification are saved to the tokenizer.json
file.
In the v0.3 version, the mistral-78-v0.1/tokenizer.json
file has no value for the normalizer:
{
...
"normalizer": null,
...
}
Modifying the normalizer
and saving the model does write the changes to the JSON file, but it is not getting picked up on reload using AutoTokenizer.from_pretrained
. I am not sure why, but it is entirely possible the tokenizer.model
file indicates no normalizer is the default and it simply does not load it.
However, you can get the tokenizer to load correctly - with the custom normalizer - by instantiating the matched tokenizer class explicitly and passing in the tokenizer.model
and tokenizer.json
paths along with the values from the tokenizer_config.json
file. In this case it is the LlamaTokenizerFast
class.
from transformers import AutoTokenizer, LlamaTokenizerFast, AddedToken
from tokenizers.normalizers import Sequence, Replace, Prepend
### load, modify, and save
tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
tok.backend_tokenizer.normalizer = Sequence([
Prepend('_'),
Replace('_', ' '),
Replace("foo", "bar"),
Replace('<br>', '\n')
])
tok.save_pretrained("mistral-7B-v0.3-custom/")
### read in config, construct the AddedToken objects
with open('mistral-7B-v0.3-custom/tokenizer_config.json') as fp:
config = json.load(fp)
config['added_tokens_decoder'] = {
int(k): AddedToken(**v)
for k, v in config.pop('added_tokens_decoder').items()
}
### load from saved files
tok_custom = LlamaTokenizerFast(
'mistral-7B-v0.3-custom/tokenizer.model',
'mistral-7B-v0.3-custom/tokenizer.json',
**config,
)
test_str = "I foo you<br>hello world"
print(' '.join(tok_custom.batch_decode(tok_custom(test_str)['input_ids'])))
# prints:
#<s> I bar you
# hello world
If you don't want to specify the tokenizer class explicitly, you can load the model the the AutoTokenizer
, and then load it again using from the resulting class. It is a hacky work-around.
tok_path = "path/to/mistral-7B-v0.3-custom/"
with open(f'{tok_path}/tokenizer_config.json') as fp:
config = json.load(fp)
config['added_tokens_decoder'] = {
int(k): AddedToken(**v)
for k, v in config.pop('added_tokens_decoder').items()
}
tok = AutoTokenizer.from_pretrained(tok_path).__class__(
f'{tok_path}/tokenizer.model',
f'{tok_path}/tokenizer.json',
**config,
)