machine-learningnlpmachine-translationseq2seqvocabulary

Is there a limit to the size of target word vocabulary that should be used in seq2seq models?


In a machine translation seq2seq model (using RNN/GRU/LSTM) we provide sentence in a source language and train the model to map it to a sequence of words in another language (e.g., English to German).

The idea is, that the decoder part generates a classification vector (which has the size of target word vocabulary) and a softmax is applied on this vector followed by an argmax to get the index of the most probable word.

My question is: is there an upper limit to how large the target word vocabulary should be, considering:

  1. The performance remains reasonable (softmax will take more time for larger vectors)
  2. The accuracy/correctness of prediction is acceptable

Solution

  • The main technical limitation of the vocabulary size is the GPU memory. The word embeddings and the output projection are the biggest parameters in the model. With a too large vocabulary, you would be forced to use small training batches which would significantly slow down the training.

    Also, it is not necessarily so that the bigger the vocabulary, the better the performance. Words in a natural language are distributed according to Zipf's law, which means that the frequency of words decreases exponentially with the frequency rank. With the increasing vocabulary size, you add words that are less and less common in the language. The word embeddings get updated only when the word occurs in the training data. With a very large vocabulary, the embeddings of less frequent words end up undertrained and the model cannot handle them properly anyway.

    MT models typically used a vocabulary of 30k-50k tokens. These are however not words, but so-called sub-words. The text gets segmented using a statistical heuristic, such that most of the common words remain as they are and less frequent words get split into subwords, ultimately into single characters.