while transfer learning / fine-tuning recent language models, such as BERT and XLNET, is by far a very common practice, how is this for GloVe?
Basically, I see two options when using GloVe to get dense vector representations that can be used by downstream NNs.
1) Fine-tune GloVe embeddings (in pytorch terms, gradient enabled)
2) Just use the embeddings without gradient.
For instance, given GloVe's embeddings matrix, I do
embed = nn.Embedding.from_pretrained(torch.tensor(embedding_matrix, dtype=torch.float))
...
dense = nn.Linear(...)
Is it best practice to solely use GloVe to get vector representation (and only train the dense layer and potentially other layers) or would one fine-tune the embeddings matrix, too?
You should absolutely fine-tune your word embedding matrix. Here is the thing, when you initialize the word embedding matrix with the GloVe word embeddings, your word embeddings will already capture most of the semantic properties of the data. However, you want your word embeddings to be tailored to the task your solving i.e task specific (Check Yang). Now, assuming that you don't have enough data in your dataset, you can't learn the word embedding matrix on your own (If you initialize the word embedding matrix with random vectors). Because of that, you want to initialize it with vectors that have been trained on huge datasets and are general.
One really important thing to keep in mind → Because the rest of your model is going to be initialized randomly, when you start training your word embedding matrix may suffer from catastrophic forgetting (Check the work of Howard and Ruder and Kirkpatrick et al.), i.e., the gradients will be huge because your model will drastically underfit the data for the first few batches, and you will lose the initial vectors completely. You can overcome this by:
For the first several epochs don't fine-tune the word embedding matrix, just keep it as it is: embeddings = nn.Embedding.from_pretrained(glove_vectors, freeze=True)
.
After the rest of the model has learned to fit your training data, decrease the learning rate, unfreeze the your embedding module embeddings.weight.requires_grad = True
, and continue training.
By following the above mentioned steps, you will get the best of both worlds. In other words, your word embeddings will still capture semantic properties while being tailored for your own downstream task. Finally, there are works (Check Ye Zhang for example) showing that it is fine to fine-tune immediately, but I would opt for the safer option.