pythonnlppytorchallennlpelmo

Understanding ELMo's number of presentations


I am trying my hand at ELMo by simply using it as part of a larger PyTorch model. A basic example is given here.

This is a torch.nn.Module subclass that computes any number of ELMo representations and introduces trainable scalar weights for each. For example, this code snippet computes two layers of representations (as in the SNLI and SQuAD models from our paper):

from allennlp.modules.elmo import Elmo, batch_to_ids

options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"

# Compute two different representation for each token.
# Each representation is a linear weighted combination for the
# 3 layers in ELMo (i.e., charcnn, the outputs of the two BiLSTM))
elmo = Elmo(options_file, weight_file, 2, dropout=0)

# use batch_to_ids to convert sentences to character ids
sentences = [['First', 'sentence', '.'], ['Another', '.']]
character_ids = batch_to_ids(sentences)

embeddings = elmo(character_ids)

# embeddings['elmo_representations'] is length two list of tensors.
# Each element contains one layer of ELMo representations with shape
# (2, 3, 1024).
#   2    - the batch size
#   3    - the sequence length of the batch
#   1024 - the length of each ELMo vector

My question concerns the 'representations'. Can you compare them to normal word2vec output layers? You can choose how many ELMo will give back (increasing an n-th dimension), but what is the difference between these generated representations and what is their typical use?

To give you an idea, for the above code, embeddings['elmo_representations'] returns a list of two items (the two representation layers) but they are identical.

In short, how can one define the 'representations' in ELMo?


Solution

  • See Section 3.2 of the original paper.

    ELMo is a task specific combination of the intermediate layer representations in the biLM. For each token, a L-layer biLM computes a set of 2L+ 1representations

    Previously in Section 3.1, it is said that:

    Recent state-of-the-art neural language models compute a context-independent token representation (via token embeddings or a CNN over characters) then pass it through L layers of forward LSTMs. At each position k, each LSTM layer outputs a context-dependent representation. The top layer LSTM output is used to predict the next token with a Softmax layer.

    To answer your question, the representations are these L LSTM-based context-dependent representations.