I'm trying to implement a Seq2Seq model with attention in CNTK, something very similar to CNTK Tutorial 204. However, several small differences lead to various issues and error messages, which I don't understand. There are many questions here, which are probably interconnected and all stem from some single thing I don't understand.
Note (in case it's important). My input data comes from MinibatchSourceFromData
, created from NumPy arrays that fit in RAM, I don't store it in a CTF.
ins = C.sequence.input_variable(input_dim, name="in", sequence_axis=inAxis)
y = C.sequence.input_variable(label_dim, name="y", sequence_axis=outAxis)
Thus, the shapes are [#, *](input_dim)
and [#, *](label_dim)
.
Question 1: When I run the CNTK 204 Tutorial and dump its graph into a .dot
file using cntk.logging.plot
, I see that its input shapes are [#](-2,)
. How is this possible?
*
) disappear?Question 2: In the same tutorial, we have attention_axis = -3
. I don't understand this. In my model there are 2 dynamic axis and 1 static, so "third to last" axis would be #
, the batch axis. But attention definitely shouldn't be computed over the batch axis.
I hoped that looking at the actual axes in the tutorial code would help me understand this, but the [#](-2,)
issue above made this even more confusing.
Setting attention_axis
to -2
gives the following error:
RuntimeError: Times: The left operand 'Placeholder('stab_result', [#, outAxis], [128])'
rank (1) must be >= #axes (2) being reduced over.
during creation of the training-time model:
def train_model(m):
@C.Function
def model(ins: InputSequence[Tensor[input_dim]],
labels: OutputSequence[Tensor[label_dim]]):
past_labels = Delay(initial_state=C.Constant(seq_start_encoding))(labels)
return m(ins, past_labels) #<<<<<<<<<<<<<< HERE
return model
where stab_result
is a Stabilizer
right before the final Dense
layer in the decoder. I can see in the dot-file that there are spurious trailing dimensions of size 1 that appear in the middle of the AttentionModel
implementation.
Setting attention_axis
to -1
gives the following error:
RuntimeError: Binary elementwise operation ElementTimes: Left operand 'Output('Block346442_Output_0', [#, outAxis], [64])'
shape '[64]' is not compatible with right operand
'Output('attention_weights', [#, outAxis], [200])' shape '[200]'.
where 64 is my attention_dim
and 200 is my attention_span
. As I understand, the elementwise *
inside the attention model definitely shouldn't be conflating these two together, therefore -1
is definitely not the right axis here.
Question 3: Is my understanding above correct? What should be the right axis and why is it causing one of the two exceptions above?
Thanks for the explanations!
First, some good news: A couple of things have been fixed in the AttentionModel in the latest master (will be generally available with CNTK 2.2 in a few days):
attention_span
or an attention_axis
. If you don't specify them and leave them at their default values, the attention is computed over the whole sequence. In fact these arguments have been deprecated.Regarding your questions:
The dimension is not negative. We use certain negative numbers in various places to mean certain things: -1 is a dimension that will be inferred once based on the first minibatch, -2 is I think the shape of a placeholder, and -3 is a dimension that will be inferred with each minibatch (such as when you feed variable sized images to convolutions). I think if you print the graph after the first minibatch, you should see all shapes are concrete.
attention_axis
is an implementation detail that should have been hidden. Basically attention_axis=-3
will create a shape of (1, 1, 200), attention_axis=-4
will create a shape of (1, 1, 1, 200) and so on. In general anything more than -3 is not guaranteed to work and anything less than -3 just adds more 1s without any clear benefit. The good news of course is that you can just ignore this argument in the latest master.
TL;DR: If you are in master (or starting with CNTK 2.2 in a few days) replace AttentionModel(attention_dim, attention_span=200, attention_axis=-3)
with
AttentionModel(attention_dim)
. It is faster and does not contain confusing arguments. Starting from CNTK 2.2 the original API is deprecated.