I'm fairly new to NLP and I was reading a blog explaining the transformer model. I was quite confused about the input/output for the decoder block (attached below). I get that y_true is fed into the decoder during the training step to combine with the output of the encoder block. What I don't get is, if we already know y_true, why run this step to get the output probability? I just don't quite get the relationship between the bottom right "Output Embedding" and the top right "Output Probabilities". When we use the model, we wouldn't really have y_true, do we just use y_pred and feed them into the decoder instead? This might be a noob question. Thanks in advance.
I get that y_true is fed into the decoder during the training step to combine with the output of the encoder block.
Well, yes and no.
The job of the decoder block is to predict the next word. The inputs to the decoder is the output of the encoder and the previous outputs of decoder block itself.
Lets take a translation example ... English to Spanish
The encoder will encode the english sentence and produce a attention vector as output. At first step the decoder will be fed the attention vector and a <START>
token. The decoder will (should) produce the first spanish word Nosotras. This is the Yt. In the next step the decoder will be fed again the attention vector as well as the <START>
token and the previous output Yt-1 Nosotras. tenemos will be the output, and so on and so forth, till the decoder spits out a <END>
token.
The decoder is thus an Autoregressive Model. It relies on its own output to generate the next sequence.