In Gensims word2vec api, I trained a model where I initialized the model with max_final_vocab = 100000 and saved the model using model.save() (This gives me one .model file, one .model.trainables.syn1neg.npy and one .model.wv.vectors.npy file).
I do not need to train model any further, so I'm fine with using just
model = gensim.models.Word2Vec.load("train.fr.model")
kv = model.wv
del model
the kv variable shown here. I now want to use only the top N (N=40000 in my case) vocabulary items instead of the entire vocabulary. The only way to even attempt cutting down the vocabulary I could find was
import numpy as np
emb_matrix = np.load("train.fr.model.wv.vectors.npy")
emb_matrix.shape
# (100000, 300)
new_emb_matrix = emb_matrix[:40000]
np.save("train.fr.model.wv.vectors.npy", new_emb_matrix)
If I load this model again though, the vocabulary still has length 100000.
I want to reduce the vocabulary of the model or model.wv while retaining a working model. Retraining is not an option.
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('train.fr.model', limit=1000)
Use optional limit
parameter to reduce number of vectors that will be loaded from Word2Vec model file.