I used the following code to load my custom-trained tokenizer:
from transformers import AutoTokenizer
test_tokenizer = AutoTokenizer.from_pretrained('raptorkwok/cantonese-tokenizer-test')
It took forever to load. Even if I replace the AutoTokenizer
with PreTrainedTokenizerFast
, it still loads forever.
How to debug or fix this issue?
The problem is resolved when downgrading transformers
version to 4.28.1
from 4.41.0
. Both pipeline()
and from_pretrained()
load the tokenizer successfully in seconds.