pytorchnlphuggingface-transformersbert-language-modelword-embedding

why nn.Embedding layer is used for positional encoding in bert?


In the huggingface implementation of bert model, for positional embedding nn.Embedding is used. Why it is used instead of traditional sin/cos positional embedding described in the transformer paper? how this two things are same?

Also I am confused about the nn.Embedding layer? there are many word embedding like word2vec, glove. among them which is actually nn.Embedding layer? Can you please explain the inner structure of nn.Embedding in detail? This question also comes in my mind.


Solution

  • nn.Embedding is just a table of vectors. Its input are indices to the table. Its output are the vectors associated to the indices from the input. Conceptually, it is equivalent to having one-hot vectors multiplied by a matrix, because the result is just the vector within the matrix selected by the one-hot input.

    BERT is based on the Transformer architecture. The Transformer architecture needs positional information to be added to the normal tokens for it to distinguish where each token is at. The positional information in the original formulation of the Transformer architecture can be incorporated in 2 different ways (both with equal performance numbers):

    The authors of the BERT article decided to go with trained positional embeddings. Anyway, in both cases the positional encodings are implemented with a normal embedding layer, where each vector of the table is associated with a different position in the input sequence.

    Update:

    Positional embeddings are not essentially different from word embeddings. The only difference is how they are trained.

    In word embeddings, you obtain the vectors so that they can be used to predict other words that appear close to the vector's word in the training data.

    In positional embeddings, each vector of the table is associated with an index representing a token position, and you train the embeddings so that the vector associated with a specific position, when added to the token embedding at that position, is helpful for the task the model is trained on (masked LM for BERT, machine translation for the original Transformer).

    Therefore, the positional embeddings end up with information that depends on the position, because the positional embedding vector is selected based on the position of the token it will be used for, and which has been trained to be useful for the task.

    Later, the authors of the Transformer article discovered that they could simply devise a "static" (not trained) version of the embeddings (i.e. the sinusoidal embeddings), which reduced the total size of the model to be stored. In this case, the information in the precomputed positional vectors, together with the learned token embeddings is enough for the model to reach the same level of performance (in the machine translation task at least).