Some of the tutorials I came across, described using a randomly initialized embedding matrix and then using the tf.nn.embedding_lookup
function to obtain the embeddings for the integer sequences. I am under the impression that since the embedding_matrix
is obtained through tf.get_variable
, the optimizer would add appropriate ops for updating it.
What I don't understand is how backpropagation happens through the lookup function which seems to be hard rather than being soft. What is the gradient of the this operation wrt. one of it's input ids?
Embedding matrix lookup is mathematically equivalent to dot product with the one-hot encoded matrix (see this question), which is a smooth linear operation.
For example, here's a lookup at the index 3
:
Here's the formula for the gradient:
... where left-hand side is the derivative of negative log-likelihood (i.e., the objective function), x
are the input words, W
is the embedding matrix and delta
is the error signal.
tf.nn.embedding_lookup
is optimized so that no one-hot encoding conversion happens, but the backprop is working according to the same formula.