Can anyone help me understand how to handle compressing/expanding the dimension of a tensor using EinsumDense?
I have a timeseries (not NLP) input tensor of the shape (batch, horizon, features)
wherein the intended output is (1, H, F)
; H
is an arbitrary horizon and F
is an arbitrary feature size. I'm actually using EinsumDense
as my Feed Forward Network in a transformer encoder module and as a final dense layer in the transformer's output. The FFN should map (1, horizon, features)
to (1, H, features)
and the final dense layer should map (1, H, features)
to (1, H, F)
.
My current equation is shf,h->shf
for the FFN, and shf,hfyz->syz
for the dense layer, however I'm getting a less than optimal result as compared to my original setup where there was no change in the horizon length and my equations were shf,h->shf
and shf,hz->shz
respectively.
My two cents,
First, an intuitive understanding of the transformer encoder: Given (batch, horizon, features)
, the attention mechanism tries to find a weighted linear combination of the projected features
. The resulting weights are learned via attention scores obtained by operating between features
, over each horizon
. The FFN layer that comes next should be a linear combination of values within features
.
Coming to EinsumDense
by way of example we have two tensors:
a: Data (your input tensor to EinsumDense
)
b: Weights (EinsumDense
's internal weights tensor)
# create random data in a 3D tensor
a = tf.random.uniform(minval=1, maxval=3, shape=(1,2,3), dtype=tf.int32)
# [[[1, 2, 2],
# [2, 2, 1]]]
shf,h->shf: This just scales the individual features.
b = tf.random.uniform(minval=2, maxval=4, shape=(2,), dtype=tf.int32)
# [3, 2]
tf.einsum('shf,h->shf', a, b)
# [[[3, 6, 6], #1st feature is scaled with 3
# [4, 4, 2]]]] #2nd feature is scaled with 2
shf,hz->shz: This does a linear combination within
features
b = tf.random.uniform(minval=2, maxval=4, shape=(2,6), dtype=tf.int32)
# [[3, 3, 3, 3, 3, 3],
# [2, 2, 2, 3, 2, 3]]
tf.einsum('shf,hz->shz', a, b)
# [[[15, 15, 15, 15, 15, 15],
# [10, 10, 10, 15, 10, 15]]]
# every value is a linear combination of the first feature [1, 2, 2] with b. The first value is sum([1,2,2]*3)
The above two resembles the transformer encoder
architecture, with a feature scaling layer. And the output structure is preserved (batch, H, F)
shf,hfyz->syz: This does both between
features and within
features combination.
b = tf.random.uniform(minval=2, maxval=4, shape=(2,3,4,5), dtype=tf.int32)
tf.einsum('shf,hfyz->syz', a,b)
# each element output `(i,j)` is a dot product of a and b[:,:,i,j]
# first element is tf.reduce_sum(a*b[:,:,0,0])
Here the output (s,y,z), y doesnt correspond to horizon
and z doesn't correspond to features
, but a combination of values between them.