In standard ANN for fully connected layers we are using the following formula: tf.matmul(X,weight) + bias
. Which is clear to me, as we use matrix multiplication in order to connect input with th hidden layer.
But in GloVe implementation(https://nlp.stanford.edu/projects/glove/) we are using the following formula for embeddings multiplication: tf.matmul(W, tf.transpose(U))
what confuses me is tf.transpose(U)
part.
Why do we use tf.matmul(W, tf.transpose(U))
instead of tf.matmul(W, U)
?
It has to do with the choice of column vs row orientation for the vectors.
Note that weight
is the second parameter here:
tf.matmul(X, weight)
But the first parameter, W
, here:
tf.matmul(W, tf.transpose(U))
So what you are seeing is a practical application of the following matrix transpose identity:
To bring it back to your example, let's assume 10 inputs and 20 outputs.
The first approach uses row vectors. A single input X
would be a 1x10
matrix, called a row vector because it has a single row. To match, the weight
matrix needs to be 10x20
to produce an output of size 20
.
But in the second approach the multiplication is reversed. That is a hint that everything is using column vectors. If the multiplication is reversed, then everything gets a transpose. So this example is using column vectors, so named because they have a single column.
That's why the transpose is there. The way they GLoVe authors have done their notation, with the multiplication reversed, the weight matrix W
must already be transposed to 20x10
instead of 10x20
. And they must be expecting a 20x1
column vector for the output.
So if the input vector U
is naturally a 1x10
row vector, it also has to be transposed, to a 10x1
column vector, to fit in with everything else.
Basically you should pick row vectors or column vectors, all the time, and then the order of multiplications and the transposition of the weights is determined for you.
Personally I think that column vectors, as used by GloVe, are awkward and unnatural compared to row vectors. It's better to have the multiplication ordering follow the data flow ordering.