pythonword-embeddingglove

how to calculate mean of words' glove embedding in a sentence


I have downloaded the glove trained matrix and used it in a Keras layer. however, I need the sentence embedding for another task.

I want to calculate the mean of all the word embeddings that are in that sentence.

what is the most efficient way to do that since there are about 25000 sentences?

also, I don't want to use a Lambda layer in Keras to get the mean of them.


Solution

  • the best way to do this is to use a GlobalAveragePooling1D layer. it receives the embeddings of tokens inside the sentences from the Embedding layer with the shapes (n_sentence, n_token, emb_dim) and computes the average of each token present in the sentence. the result has shape (n_sentence, emb_dim)

    here a code example

    embedding_dim = 128
    vocab_size = 100
    sentence_len = 20
    
    embedding_matrix = np.random.uniform(-1,1, (vocab_size,embedding_dim))
    test_sentences = np.random.randint(0,vocab_size, (3,sentence_len))
    
    inp = Input((sentence_len))
    embedder = Embedding(vocab_size, embedding_dim,
                         trainable=False, weights=[embedding_matrix])(inp)
    avg = GlobalAveragePooling1D()(embedder)
    
    model = Model(inp, avg)
    model.summary()
    
    model(test_sentences) # the mean of all the word embeddings inside sentences