kerasneural-networkword-embeddingglove

Should the vocabulary be restricted to the training-set vocabulary when training an NN model with pretrained word2vec like GLOVE?


I wanted to use word embeddings for the embedding Layer in my neural network using pre-trained vectors from GLOVE. Do I need to restrict the vocabulary to the training-set when constructing the word2index dictionary? Wouldn't that lead to a limited non-generalizable model? Is considering all the vocabulary of GLOVE a recommended practice?


Solution

  • Yes, it is better to restrict your vocab size. Because pre-trained embeddings (like GLOVE) have many words in them that are not very useful (and so Word2Vec) and the bigger vocab size the more RAM you need and other problems.

    Select your tokens from all of your data. it won't lead to a limited non-generalizable model if your data is big enough. if you think that your data does not have as many tokens as are needed, then you should know 2 things:

    1. Your data is not good enough and you have to gather more.
    2. Your model can't generate well on the tokens that it hasn't seen at training! so it has no point to having many unused words on your embedding and better to gather more data to cover those words.

    I have an answer to show how you can select a minor set of word vectors from a pre-trained model in here