I generated the word embeddings for my corpus(2-D List) then tried to generate the Average Word2Vec embeddings for each of the individual word list(that is for each comment which have been converted into a list though split() method) inside my corpus but the final length of my average word2vec embeddings numpy array and that of the #rows doesn't match i.e. 159571, which is the number of comments.
here's the code for generating the 'final_embeddings' array:
#Building vocabulary
vocabulary = set(model.wv.index_to_key)
final_embeddings = []
for i in flatten_corpus:
avg_embeddings = None
for j in i:
if j in vocabulary:
if avg_embeddings is None:
avg_embeddings = model.wv[j]
else:
avg_embeddings = avg_embeddings + model.wv[j]
if avg_embeddings is not None:
avg_embeddings = avg_embeddings / len(avg_embeddings)
final_embeddings.append(avg_embeddings)
what am I doing wrong?
You are only appending into your final_embeddings
in a code branch that's only sometimes reached: if there's at least one known word in the text.
If any element of flatten_corpus
only includes words that aren't in the model, it will simply proceed to the next item in flatten_corpus
.
And then, you'll not only be missing those 84 items, but the average vectors in final_embeddings
will no longer be aligned at the same slot indexes as their matching texts.
A quick and dirty fix would be to initialize your avg_embeddings
to some value that stands-in, as the default, even if none of the words are known. For example:
avg_embeddings = np.zeros(model.vector_size, dtype=np.float32)
Of course, having 84 of your per-text summary average vector be zero-vectors may cause other problems down the way, so you may want to think more about what, if anything you should be doing for such texts. Maybe, without word-vectors to model them, they should just be ignored.
Other notes on making code that is easier to debug:
using descriptive temporary variable names like 'text' & 'word' instead of 'i' & 'j' makes code clearer
you can already test whether a word is inside a set of word-vectors (model.wv
, of Gensim class type KeyedVectors
) with idiomatic Python membership-checking, so there's no need to create your vocabulary
set – instead just check with if word in model.wv:
.
the KeyedVectors
object has a utility method for getting the average of the word-vectors of a list-of-words, with other options that could prove helpful: .get_mean_vector()
– and if you combine that with a Python list comprehension, your code can be replaced by a 1-liner:
final_embeddings = [model.wv.get_mean_vector(text) for text in flatten_corpus]