When doing machine translation, if you segment words, such as using BPE, how big is the processed vocabulary?
The BPE algorithm starts with a list of characters in the data and iteratively merges the most frequent symbol pairs. If the algorithm would not have a stopping criterion, you would end up with a vocabulary that covers all words from your training data + all characters + all merges in between the characters and the words.
The reason for using BPE is that we just cannot afford to use a vocabulary that contains all words from the training data: it can easily be millions of word forms. When using BPE, you thus need to say in advance how many merges you want to have. Typically, the number of merges is 20–50k. It ensures that most frequent words remain untouched whereas the less frequent words get split into smaller units. The resulting vocabulary size is then the number of merges + the original alphabet size.