pytorchrecurrent-neural-networkword-embeddingattention-modelsequence-to-sequence

Embedding layer in neural machine translation with attention


I am trying to understanding how to implement a seq-to-seq model with attention from this website.

My question: Is nn.embedding just returns some IDs for each word, so the embedding for each word would be the same during whole training? Or are they getting changed during the procedure of training?

My second question is because I am confused whether after training, the output of nn.embedding is something such as word2vec word embeddings or not.

Thanks in advance


Solution

  • According to the PyTorch docs:

    A simple lookup table that stores embeddings of a fixed dictionary and size.

    This module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.

    In short, nn.Embedding embeds a sequence of vocabulary indices into a new embedding space. You can indeed roughly understand this as a word2vec style mechanism.

    As a dummy example, let's create an embedding layer that takes as input a total of 10 vocabularies (i.e. the input data only contains a total of 10 unique tokens), and returns embedded word vectors living in 5-dimensional space. In other words, each word is represented as 5-dimensional vectors. The dummy data is a sequence of 3 words with indices 1, 2, and 3, in that order.

    >>> embedding = nn.Embedding(10, 5)
    >>> embedding(torch.tensor([1, 2, 3]))
    tensor([[-0.7077, -1.0708, -0.9729,  0.5726,  1.0309],
            [ 0.2056, -1.3278,  0.6368, -1.9261,  1.0972],
            [ 0.8409, -0.5524, -0.1357,  0.6838,  3.0991]],
           grad_fn=<EmbeddingBackward>)
    

    You can see that each of the three words are now represented as 5-dimensional vectors. We also see that there is a grad_fn function, which means that the weights of this layer will be adjusted through backprop. This answers your question of whether embedding layers are trainable: the answer is yes. And indeed this is the whole point of embedding: we expect the embedding layer to learn meaningful representations, the famous example of king - man = queen being the classic example of what these embedding layers can learn.


    Edit

    The embedding layer is, as the documentation states, a simple lookup table from a matrix. You can see this by doing

    >>> embedding.weight
    Parameter containing:
    tensor([[-1.1728, -0.1023,  0.2489, -1.6098,  1.0426],
            [-0.7077, -1.0708, -0.9729,  0.5726,  1.0309],
            [ 0.2056, -1.3278,  0.6368, -1.9261,  1.0972],
            [ 0.8409, -0.5524, -0.1357,  0.6838,  3.0991],
            [-0.4569, -1.9014, -0.0758, -0.6069, -1.2985],
            [ 0.4545,  0.3246, -0.7277,  0.7236, -0.8096],
            [ 1.2569,  1.2437, -1.0229, -0.2101, -0.2963],
            [-0.3394, -0.8099,  1.4016, -0.8018,  0.0156],
            [ 0.3253, -0.1863,  0.5746, -0.0672,  0.7865],
            [ 0.0176,  0.7090, -0.7630, -0.6564,  1.5690]], requires_grad=True)
    

    You will see that the first, second, and third rows of this matrix corresponds to the result that was returned in the example above. In other words, for a vocabulary whose index is n, the embedding layer will simply "lookup" the nth row in its weights matrix and return that row vector; hence the lookup table.