I have studied the theory of seq2seq model but I couldn't clearly understand that what exactly is context vector and how is it generated. I know it summarizes the meaning of to-be-encoded sequence into it but how exactly?
In attention mechanism it was ; ci = Σ( αij hj ) [according to Dzmitry Bahdanau 2014]
But in normal seq2seq, I couldn't find a formula for context vector in Ilya Sutskever 2014 and on internet, There is only given formula of conditional probability as (y1,y2,...,yt|x1,x2,..,xt).
I am also confused that is classic seq2seq context vector of a sentence is same as average word2vec?
In Short, I am expecting a clear working of how context vector is created and what does it presents and how. Furthermore, how decoder extracts information from it.
I am expecting a clear working of how context vector is created and what does it presents and how. Furthermore, how decoder extracts information from it.
In a sequence-to-sequence (seq2seq) model, the context vector is a representation of the input sequence generated by the encoder and used by the decoder to generate the output sequence. The encoder produces a set of hidden states that capture relevant information about the input sequence up to that point in time. The context vector is then generated by combining these hidden states in some way. In the attention mechanism, the context vector is a weighted sum of the hidden states, while in a basic seq2seq model without attention, the context vector is typically the final hidden state produced by the encoder. During decoding, the context vector is used by the decoder to generate each element of the output sequence. The context vector is not the same as the average of the word embeddings in the input sequence and is a learned representation that is specific to the task at hand and model architecture. The formula you provided, ci = Σ( αij hj ), is the standard formula for computing the context vector using the attention mechanism.
In a basic seq2seq model without attention, the context vector is typically the final hidden state produced by the encoder. This hidden state is then used as the initial hidden state for the decoder, which generates the output sequence one step at a time.
During decoding, the context vector is used by the decoder to generate each element of the output sequence. At each time step, the decoder takes in the previous output element and the current hidden state and produces a new hidden state and output element
As an example let's say we have Arabic sentence and we want to translate it to English sentence. Here is what happens, we accomplish this task by training arabic as input sequence and Arabic as output sequence. Now the model consists of main components: an encoder and a decoder. The encoder takes in the English sentence as input and produces a fixed-length context vector that summarizes the input sequence. The decoder then takes in the context vector and generates the corresponding English translation one word at a time.
Andrew ng videos on youtube provide perfect explanation I myself learned it from him: https://www.youtube.com/watch?v=IV8--Y3evjw&list=PLiWO7LJsDCHcpUmL9grX9WLjyi-e92iCO