gensimword2vecvocabulary

In Gensim Word2vec, how to reduce the vocab size of an existing model?


In Gensims word2vec api, I trained a model where I initialized the model with max_final_vocab = 100000 and saved the model using model.save() (This gives me one .model file, one .model.trainables.syn1neg.npy and one .model.wv.vectors.npy file).

I do not need to train model any further, so I'm fine with using just

model = gensim.models.Word2Vec.load("train.fr.model")
kv = model.wv
del model


the kv variable shown here. I now want to use only the top N (N=40000 in my case) vocabulary items instead of the entire vocabulary. The only way to even attempt cutting down the vocabulary I could find was

import numpy as np
emb_matrix = np.load("train.fr.model.wv.vectors.npy")
emb_matrix.shape
# (100000, 300)
new_emb_matrix = emb_matrix[:40000]
np.save("train.fr.model.wv.vectors.npy", new_emb_matrix)

If I load this model again though, the vocabulary still has length 100000.

I want to reduce the vocabulary of the model or model.wv while retaining a working model. Retraining is not an option.


Solution

  • from gensim.models import KeyedVectors
    
    model = KeyedVectors.load_word2vec_format('train.fr.model', limit=1000)
    

    Use optional limitparameter to reduce number of vectors that will be loaded from Word2Vec model file.