I'm trying to build a basic transformer using keras Attention Layer. For this I need to have 3 different dense layer, each of which generates key,query and value matrices respectively, by running every word embedding through them. But there seems to be no such functionality with keras. here's what I've got so far
#16 word embeddigs with dimension 64
input = layers.Input(shape=(16,64))
key=layers.Dense(64,activation="relu")(input)
query=layers.Dense(64,activation="relu")(input)
value=layers.Dense(64,activation="relu")(input)
x=layers.Attention()()[key,query,value]
result=layers.Dense(8,activation="sigmoid")(x)
The probelm with this code is, that if you feed a matrix into a dense layer, it doesn't process it row by row, instead calculating the dot product between the rows and then feeding it into the network:
Note: If the input to the layer has a rank greater than 2, then Dense computes the dot product between the inputs and the kernel along the last axis of the inputs and axis 0 of the kernel (using tf.tensordot). For example, if input has dimensions (batch_size, d0, d1), then we create a kernel with shape (d1, units), and the kernel operates along axis 2 of the input, on every sub-tensor of shape (1, 1, d1) (there are batch_size * d0 such sub-tensors). The output in this case will have shape (batch_size, d0, units).
So how can I feed in a matrix and then have it processed row by row?
I didn't understand the documentation and it does exactly what I want, since the kernel (the params or connection between neurons) are multiplied by the input matrix, meaning for each row a new row is calculated.