I use a BertTokenizer
and add my custom tokens using add_tokens()
function.
Minimal sample code here:
checkpoint = 'fnlp/bart-base-chinese'
tokenizer = BertTokenizer.from_pretrained(checkpoint)
tokenizer.add_tokens(["Token1", "Token2"]) # just some samples, I added a million tokens
model = BartForConditionalGeneration.from_pretrained(checkpoint, output_attentions = True, output_hidden_states = True)
training_args = Seq2SeqTrainingArguments(
output_dir = output_model,
evaluation_strategy = "epoch",
optim = "adamw_torch",
eval_steps = 1000,
save_strategy = "epoch",
per_device_train_batch_size = batch_size,
per_device_eval_batch_size = batch_size,
weight_decay = 0.01,
save_total_limit = 1,
num_train_epochs = 30,
predict_with_generate=True,
remove_unused_columns=True,
fp16 = True,
metric_for_best_model = "bleu",
load_best_model_at_end = True,
)
trainer = Seq2SeqTrainer(
model = model,
args = training_args,
train_dataset = train_data,
eval_dataset = eval_data,
tokenizer = tokenizer, # I use the tokenizer with added tokens here
data_collator = data_collator,
compute_metrics = compute_metrics,
callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)
trainer.train()
trainer.push_to_hub(output_model, private=True)
The training process was completed without a problem. But when I use the new model in a pipeline, there is a high chance that the exception: PanicException: AddedVocabulary bad split has occurred. Here is the pipeline code:
text = "Words to translate"
from transformers import pipeline, BertTokenizer
hf_model_name = "my_huggingface_username/" + output_model
translator = pipeline("translation", model=hf_model_name, max_length=200)
print(translator(text)[0]['translation_text'].replace(' ', ''))
I cannot find a pattern and cause of why the exception happens. How can I resolve this PanicException
problem?
The PanicException
is resolved when changing the pipeline from:
translator = pipeline("translation", model=hf_model_name, max_length=200)
print(translator(text)[0]['translation_text'].replace(' ', ''))
to:
custom_tokenizer = BertTokenizer.from_pretrained(hf_model_name)
translator = pipeline("translation", model=custom_tokenizer, max_length=200)
print(translator(text)[0]['translation_text'].replace(' ', ''))
The pipeline function uses AutoTokenizer
instead of BertTokenizer
, which leads to the PanicException
.
From the source code:
If not provided, the default tokenizer for the given
model
will be loaded (if it is a string). Ifmodel
is not specified or not a string, then the default tokenizer forconfig
is loaded (if it is a string). However, ifconfig
is also not given or not a string, then the default tokenizer for the giventask
will be loaded.
From the actual code, it uses AutoTokenizer
, which caused the problem.
tokenizer = AutoTokenizer.from_pretrained(
tokenizer_identifier, use_fast=use_fast, _from_pipeline=task, **hub_kwargs, **tokenizer_kwargs
)