So I trained a tokenizer from scratch using Huggingface’s tokenizers library (not AutoTokenizer.from_pretrained, but actually trained a new one). Seemed to go fine, no errors. But when I try to use it during inference, it splits words in weird places. even pretty common ones like “awesome” or “terrible” end up getting split into multiple subwords like aw, ##es, ##ome, etc.
I expected a fresh tokenizer to do better with those kinds of words since I saw them in the training data.
Here’s how I trained the tokenizer (simplified version):
from tokenizers import BertWordPieceTokenizer
files = ["data.txt"] #contians one text per line
tokenizer = BertWordPieceTokenizer(lowercase=True)
tokenizer.train(files=files, vocab_size=3000, min_frequency=2, special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])
tokenizer.save_model("my_tokenizer")
And this is how I use it later:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("my_tokenizer")
text = "this movie was awesome and I loved the acting"
tokens = tokenizer.tokenize(text)
print(tokens)
Which gives me:
['this', 'movie', 'was', 'aw', '##es', '##ome', 'and', 'i', 'loved', 'the', 'acting']
So like... why is “awesome” getting split into 3 tokens? That word appears in the training file multiple times, definitely more than the min_frequency of 2. I even checked the vocab file, and I don’t see “awesome” as a full token in there.
I tried:
increasing vocab_size to 10k (same issue)
lowering min_frequency to 1
turning off lowercase
checking the vocab.txt , still doesn’t have full words I expect
Maybe I’m misunderstanding how the tokenizer learns or builds its vocab? Or is there something I’m doing wrong during training?
If needed I can share a dummy version of the data.txt file I used. It’s just a list of simple sentences like:
this movie was awesome
terrible film
acting was good
i loved it
Would appreciate any ideas, not sure if this is expected behavior or if I messed something up in how I’m training it.
Yeah, this actually comes up a lot when training a tokeniser from scratch. Just because a word shows up in your training data doesn’t mean it will end up in the vocab. It depends on how the tokeniser is building things.
Even if “awesome” appears a bunch of times, it might not make it into the vocab as a full word. WordPiece tokenisers don’t just add whole words automatically. They try to balance coverage and compression, so sometimes they keep subword pieces instead.
If you want common words like that to stay intact, here are a few things you can try:
Increase vocab_size to something like 8000 or 10000. With 3000, you are going to see a lot of splits.
Lowering min_frequency might help, but only if the word is just barely making the cut.
Check the text file you're using to train. If “awesome” shows up with different casing or punctuation, like “Awesome” or “awesome,”, it might be treated as separate entries.
Also make sure it’s not just appearing two or three times in a sea of other data. That might not be enough for it to get included.
Another thing to be aware of is that when you load the tokeniser using BertTokenizer.from_pretrained(), it expects more than just a vocab file. It usually looks for tokenizer_config.json, special_tokens_map.json, and maybe a few others. If those aren't there, sometimes things load strangely. You could try using PreTrainedTokenizerFast instead, especially if you trained the tokeniser with the tokenizers library directly.
You can also just check vocab.txt and search for “awesome”. If it’s not in there as a full token, that would explain the split you are seeing.
Nothing looks broken in your code. This is just standard behaviour for how WordPiece handles vocab limits and slightly uncommon words. I’ve usually had better results with vocab sizes in the 8 to 16k range when I want to avoid unnecessary token splits.