[SOLVED] how to calculate mean of words' glove embedding in a sentence

how to calculate mean of words' glove embedding in a sentence

I have downloaded the glove trained matrix and used it in a Keras layer. however, I need the sentence embedding for another task.

I want to calculate the mean of all the word embeddings that are in that sentence.

what is the most efficient way to do that since there are about 25000 sentences?

also, I don't want to use a Lambda layer in Keras to get the mean of them.

Solution

the best way to do this is to use a GlobalAveragePooling1D layer. it receives the embeddings of tokens inside the sentences from the Embedding layer with the shapes (n_sentence, n_token, emb_dim) and computes the average of each token present in the sentence. the result has shape (n_sentence, emb_dim)

here a code example

embedding_dim = 128
vocab_size = 100
sentence_len = 20

embedding_matrix = np.random.uniform(-1,1, (vocab_size,embedding_dim))
test_sentences = np.random.randint(0,vocab_size, (3,sentence_len))

inp = Input((sentence_len))
embedder = Embedding(vocab_size, embedding_dim,
                     trainable=False, weights=[embedding_matrix])(inp)
avg = GlobalAveragePooling1D()(embedder)

model = Model(inp, avg)
model.summary()

model(test_sentences) # the mean of all the word embeddings inside sentences