huggingface-transformershuggingface-tokenizershuggingface-trainer

PanicException: AddedVocabulary bad split AFTER adding tokens to BertTokenizer


I use a BertTokenizer and add my custom tokens using add_tokens() function.

Minimal sample code here:

checkpoint = 'fnlp/bart-base-chinese'
tokenizer = BertTokenizer.from_pretrained(checkpoint)
tokenizer.add_tokens(["Token1", "Token2"]) # just some samples, I added a million tokens
model = BartForConditionalGeneration.from_pretrained(checkpoint, output_attentions = True, output_hidden_states = True)

training_args = Seq2SeqTrainingArguments(
    output_dir = output_model,
    evaluation_strategy = "epoch",
    optim = "adamw_torch", 
    eval_steps = 1000,
    save_strategy = "epoch",
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    weight_decay = 0.01,
    save_total_limit = 1,
    num_train_epochs = 30, 
    predict_with_generate=True,
    remove_unused_columns=True,
    fp16 = True,
    metric_for_best_model = "bleu",
    load_best_model_at_end = True,
)

trainer = Seq2SeqTrainer(
    model = model,
    args = training_args,
    train_dataset = train_data,
    eval_dataset = eval_data, 
    tokenizer = tokenizer, # I use the tokenizer with added tokens here
    data_collator = data_collator,
    compute_metrics = compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)
trainer.train()
trainer.push_to_hub(output_model, private=True)

The training process was completed without a problem. But when I use the new model in a pipeline, there is a high chance that the exception: PanicException: AddedVocabulary bad split has occurred. Here is the pipeline code:

text = "Words to translate"

from transformers import pipeline, BertTokenizer

hf_model_name = "my_huggingface_username/" + output_model

translator = pipeline("translation", model=hf_model_name, max_length=200)
print(translator(text)[0]['translation_text'].replace(' ', ''))

I cannot find a pattern and cause of why the exception happens. How can I resolve this PanicException problem?


Solution

  • The PanicException is resolved when changing the pipeline from:

    translator = pipeline("translation", model=hf_model_name, max_length=200)
    print(translator(text)[0]['translation_text'].replace(' ', ''))
    

    to:

    custom_tokenizer = BertTokenizer.from_pretrained(hf_model_name)
    translator = pipeline("translation", model=custom_tokenizer, max_length=200)
    print(translator(text)[0]['translation_text'].replace(' ', ''))
    

    The pipeline function uses AutoTokenizer instead of BertTokenizer, which leads to the PanicException.

    From the source code:

    If not provided, the default tokenizer for the given model will be loaded (if it is a string). If model is not specified or not a string, then the default tokenizer for config is loaded (if it is a string). However, if config is also not given or not a string, then the default tokenizer for the given task will be loaded.

    From the actual code, it uses AutoTokenizer, which caused the problem.

    tokenizer = AutoTokenizer.from_pretrained(
        tokenizer_identifier, use_fast=use_fast, _from_pipeline=task, **hub_kwargs, **tokenizer_kwargs
    )