Python word2vec updates

I am trying to convert this old snippet of code to be in line with the updated version of gensim. I was able to convert the model.wv.vocab to model.wv.key_to_index but am having issues with the model[model.wv.vocab] and how to convert that.

Here is that the code looks like:

model = Word2Vec(corpus, min_count = 1, vector_size = 5 )

#pass the embeddings to PCA
X = model[model.wv.vocab]

pca = PCA(n_components=2)
result = pca.fit_transform(X)

#create df from the pca results
pca_df = pd.DataFrame(result, columns = ['x','y'])

I have tried this:

#pass the embeddings to PCA
X = model.wv.key_to_index
pca = PCA(n_components=2)
result = pca.fit_transform(X)

#create df from the pca results
pca_df = pd.DataFrame(result, columns = ['x','y'])

and keep getting errors. Here is what model.wv.key_to_index looks like:

{'the': 0,
 'in': 1,
 'of': 2,
 'on': 3,
 '': 4,
 'and': 5,
 'a': 6,
 'to': 7,
 'were': 8,
 'forces': 9,
 'by': 10,
 'was': 11,
 'at': 12,
 'against': 13,
 'for': 14,
 'protest': 15,
 'with': 16,
 'an': 17,
 'as': 18,
 'police': 19,
 'killed': 20,
 'district': 21,
 'city': 22,
 'people': 23,
 'al': 24,
 'came': 996,
 'donbass': 997,
 'resulting': 998,
 'financial': 999}

Solution

Your question doesn't show which errors you keep getting, which would usually help identify what's going wrong.

However, it looks like your original (older circa Gensim-3.x.x) code's line…

X = model[model.wv.vocab]

…intends to assemble (per scikit-learn PCA api) a array-like of shape (n_samples, n_features), by looking up every key in model.wv.vocab in model to assemble a new array, where each row is one of the vocab word-vectors.

The most direct replacement for that line would thus be to just use the model's existing internal array of word-vectors:

X = model.wv.vectors

That is: for this use, you don't need to look up words individually, or create a new array of results. The existing in-model array is already exactly what you need.

Of course if you instead want to use subsets of words, you might want to look up mixtures of word individually. Still, for the specific case of using, say, the first 10 words (as in your sibling answer), you could also just use a numpy array 'view' on the existing array, accessed via Python slice notation:

first_ten_word_vectors = model.wv.vectors[:10]

(As these models typically front-load the storage ordering of words with the most-frequent words, and the "long-tail" of less-frequent words have worse vectors and less utility, working with just the "top-N" of words, while ignoring other words, often improves overall resource usage and evaluated performance. More isn't alway better, when that 'more' is in the less-informative 'noise' of your training data or later texts.)

Two other unrelated notes on your example code – which I recognize may just be a toy demop of some larger exercise, but still useful to remember:

min_count=1 is almost always a bad idea with Word2Vec & similar algorithms. Words in your corpus with just 1, or a few, usage examples won't get good vectors – a single usage context won't be representative of the broad/bushy aspects of the word's real meaning – but will, in aggregate, take up a lot of the model's RAM & training-time, and their bad representations will dilute the good representations for more-frequent words. So higher min_count values that actively discard rarer words often improve all three of {training_time, memory_usage, vector_quality} – and if you don't have enough training texts to use, or increase, the class default min_count=5, you may not have a task at which word2vec will work well until you get more data.
Similarly, the usual strengths of the word2vec algorithm are only shown with high-dimensional word-vectors – typically, at least 50-100 dimensions (and often 300+), with a commensurately-sufficient amount of training data. So, except for testing that code runs, or showing syntax, using any value as low as vector_size=5 will often mislead about how word2vec behaves at usual scales. It's not even faintly-illustrative of the useful relationships that appear with more-dimensions & plentiful data.