I want to tie weights of the embedding
layer and the next_word
prediction layer of the decoder. The embedding dimension is set to 300 and the hidden size of the decoder is set to 600. Vocabulary size of the target language in NMT is 50000, so embedding weight dimension is 50000 x 300
and weight of the linear layer which predicts the next word is 50000 x 600
.
So, how can I tie them? What will be the best approach to achieve weight tying in this scenario?
You could use linear layer to project the 600 dimensional space down to 300 before you apply the shared projection. This way you still get the advantage that the entire embedding (possibly) has a non-zero gradient for each mini-batch but at the risk of increasing the capacity of the network slightly.