tensorflownlpembeddingelmo

confuse about parameter 'tokens_length' of elmo model in tensorflow hub


I'm looking the ELMo model in tensorflow hub and, I'm not very clear about what does tokens_length = [6, 5] means in the flow example use: (https://tfhub.dev/google/elmo/2)

elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
tokens_input = [["the", "cat", "is", "on", "the", "mat"],
                ["dogs", "are", "in", "the", "fog", ""]]
tokens_length = [6, 5]
embeddings = elmo(
    inputs={
        "tokens": tokens_input,
        "sequence_len": tokens_length
    },
    signature="tokens",
    as_dict=True)["elmo"]

It does't like the max length for the input token sentence, does't like [max number of words for each sentence, numbers of sentence] either, that makes me confused. Could someone explain this? Thanks!


Solution

  • The first example has length 6 and the second example has length 5:. i.e.

    "the cat is on the mat" is 6 words long but "dogs are in the fog" is only 5 words long. The extra empty string in the input does add a little confusion :-/

    If you read the docs on that page it explains why this is needed (bold font is mine)

    With the tokens signature, the module takes tokenized sentences as input. The input tensor is a string tensor with shape [batch_size, max_length] and an int32 tensor with shape [batch_size] corresponding to the sentence length. The length input is necessary to exclude padding in the case of sentences with varying length.