pythonpcagensimword2vec

Python word2vec updates


I am trying to convert this old snippet of code to be in line with the updated version of gensim. I was able to convert the model.wv.vocab to model.wv.key_to_index but am having issues with the model[model.wv.vocab] and how to convert that.

Here is that the code looks like:

model = Word2Vec(corpus, min_count = 1, vector_size = 5 )

#pass the embeddings to PCA
X = model[model.wv.vocab]

pca = PCA(n_components=2)
result = pca.fit_transform(X)

#create df from the pca results
pca_df = pd.DataFrame(result, columns = ['x','y'])

I have tried this:

#pass the embeddings to PCA
X = model.wv.key_to_index
pca = PCA(n_components=2)
result = pca.fit_transform(X)

#create df from the pca results
pca_df = pd.DataFrame(result, columns = ['x','y'])

and keep getting errors. Here is what model.wv.key_to_index looks like:

{'the': 0,
 'in': 1,
 'of': 2,
 'on': 3,
 '': 4,
 'and': 5,
 'a': 6,
 'to': 7,
 'were': 8,
 'forces': 9,
 'by': 10,
 'was': 11,
 'at': 12,
 'against': 13,
 'for': 14,
 'protest': 15,
 'with': 16,
 'an': 17,
 'as': 18,
 'police': 19,
 'killed': 20,
 'district': 21,
 'city': 22,
 'people': 23,
 'al': 24,
 'came': 996,
 'donbass': 997,
 'resulting': 998,
 'financial': 999}

Solution

  • Your question doesn't show which errors you keep getting, which would usually help identify what's going wrong.

    However, it looks like your original (older circa Gensim-3.x.x) code's line…

    X = model[model.wv.vocab]
    

    …intends to assemble (per scikit-learn PCA api) a array-like of shape (n_samples, n_features), by looking up every key in model.wv.vocab in model to assemble a new array, where each row is one of the vocab word-vectors.

    The most direct replacement for that line would thus be to just use the model's existing internal array of word-vectors:

    X = model.wv.vectors
    

    That is: for this use, you don't need to look up words individually, or create a new array of results. The existing in-model array is already exactly what you need.

    Of course if you instead want to use subsets of words, you might want to look up mixtures of word individually. Still, for the specific case of using, say, the first 10 words (as in your sibling answer), you could also just use a numpy array 'view' on the existing array, accessed via Python slice notation:

    first_ten_word_vectors = model.wv.vectors[:10]
    

    (As these models typically front-load the storage ordering of words with the most-frequent words, and the "long-tail" of less-frequent words have worse vectors and less utility, working with just the "top-N" of words, while ignoring other words, often improves overall resource usage and evaluated performance. More isn't alway better, when that 'more' is in the less-informative 'noise' of your training data or later texts.)

    Two other unrelated notes on your example code – which I recognize may just be a toy demop of some larger exercise, but still useful to remember: