pythonhuggingface-transformers

BertTokenizer.from_pretrained raises UnicodeDecodeError


I pre-trained a pytorch_model.bin from a pre-train script. Yet when I load it with the following codes, it raises UnicodeDecodeError. Codes are as follows:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("/path/to/pytorch_model.bin") # Raise UnicodeDecodeError

The traceback is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1811, in from_pretrained
    return cls._from_pretrained(
  File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1965, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/transformers/models/bert/tokenization_bert.py", line 218, in __init__
    self.vocab = load_vocab(vocab_file)
  File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/transformers/models/bert/tokenization_bert.py", line 121, in load_vocab
    tokens = reader.readlines()
  File "/opt/tljh/user/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

How can I resolve this issue?

Versions:


Solution

  • from_pretrained take as input the path to the directory containing model weights saved using save_pretrained() not the bin file .

    You can save your model:

    model.save_pretrained("my_model_directory")
    

    Then You can load it :

    BertTokenizer.from_pretrained("my_model_directory")