machine-translation

Choose vocabulary size of tokenizer


I have a dataset that has about 150,000 sentence pairs for machine translation tasks. I must build a tokenizer from datasets of both source and target language.

Should I choose vocabulary size for the tokenizer? Thank you


Solution

  • The optimal vocabulary size depends both on the dataset size and the languages. The most common vocabulary size in machine translation competitions is 32k (cf. a blog post). The rule of thumb says, the smaller the dataset, the smaller the subword vocabulary you should use. For 150k sentences, 8k might be a good choice. You can also get an idea of how the vocabulary size influences the translation quality from Table 3 of this paper.

    It is not always the case that the bigger the vocabulary, the higher the quality. Rare tokens in the vocabulary are rarely updated, so their embeddings can get out of sync with the rest of the network. Therefore, for smaller datasets, smaller vocabulary sizes might be better.