pythonmachine-learningnlpword2vec

shape of my dataframe(#rows) and that of final embeddings array doesn't match


I generated the word embeddings for my corpus(2-D List) then tried to generate the Average Word2Vec embeddings for each of the individual word list(that is for each comment which have been converted into a list though split() method) inside my corpus but the final length of my average word2vec embeddings numpy array and that of the #rows doesn't match i.e. 159571, which is the number of comments.

here's the code for generating the 'final_embeddings' array:

#Building vocabulary
vocabulary = set(model.wv.index_to_key)

final_embeddings = []
for i in flatten_corpus:
    avg_embeddings = None
    for j in i:
      
         if j in vocabulary:

            if avg_embeddings is None:
                avg_embeddings = model.wv[j]
            else:
                avg_embeddings = avg_embeddings + model.wv[j]
    if avg_embeddings is not None:
        avg_embeddings = avg_embeddings / len(avg_embeddings)
        final_embeddings.append(avg_embeddings)

what am I doing wrong?


Solution

  • You are only appending into your final_embeddings in a code branch that's only sometimes reached: if there's at least one known word in the text.

    If any element of flatten_corpus only includes words that aren't in the model, it will simply proceed to the next item in flatten_corpus.

    And then, you'll not only be missing those 84 items, but the average vectors in final_embeddings will no longer be aligned at the same slot indexes as their matching texts.

    A quick and dirty fix would be to initialize your avg_embeddings to some value that stands-in, as the default, even if none of the words are known. For example:

        avg_embeddings = np.zeros(model.vector_size, dtype=np.float32)
    

    Of course, having 84 of your per-text summary average vector be zero-vectors may cause other problems down the way, so you may want to think more about what, if anything you should be doing for such texts. Maybe, without word-vectors to model them, they should just be ignored.

    Other notes on making code that is easier to debug:

    final_embeddings = [model.wv.get_mean_vector(text) for text in flatten_corpus]