pythonvectorgensimword2vecvector-space

How to manually change the vector dimensions of a word in Gensim Word2Vec


I have a Word2Vec model with a lot of word vectors. I can access a word vector as so.

word_vectors = gensim.models.Word2Vec.load(wordspace_path)
print(word_vectors['boy'])

Output

[ -5.48055351e-01   1.08748421e-01  -3.50534245e-02  -9.02988110e-03...]

Now I have a proper vector representation that I want to replace the word_vectors['boy'] with.

word_vectors['boy'] = [ -7.48055351e-01   3.08748421e-01  -2.50534245e-02  -10.02988110e-03...]

But the following error is thrown

TypeError: 'Word2Vec' object does not support item assignment

Is there any fashion or workaround to do this? That is manipulate word vectors manually once the model is trained? Is it possible in other platforms except Gensim?


Solution

  • Since word2vec vectors are typically only created by the iterative training process, then accessed, the gensim Word2Vec object does not support direct assignment of new values by its word indexes.

    However, as it is in Python, all its internal structures are fully viewable/tamperable by you, and as it is open-source, you can view exactly how it does all of its existing functionality, and use that as a model for how to do new things.

    Specifically, the raw word-vectors are (in recent versions of gensim) stored in a property of the Word2Vec object called wv, and this wv property is an instance of KeyedVectors. If you examine its source code, you can see accesses of word-vectors by string key (eg 'boy'), including those by []-indexing implemented by the __getitem__() method, go through its method word_vec(). You can view the source of that method either in your local installation, or at Github:

    https://github.com/RaRe-Technologies/gensim/blob/c2201664d5ae03af8d90fb5ff514ffa48a6f305a/gensim/models/keyedvectors.py#L265

    There you'll see the word is actually converted to an integer-index (via self.vocab[word].index) then used to access an internal syn0 or syn0norm array (depending on whether the user is accessing the raw or unit-normalized vector). If you look elsewhere where these are set up, or simply examine them in your own console/code (as if by word_vectors.wv.syn0), you'll see these are numpy arrays which do support direct assignment by index.

    So, you can directly tamper with their values by integer index, as if by:

    word_vectors.wv.syn0[word_vectors.wv.vocab['boy'].index] = [ -7.48055351e-01   3.08748421e-01  -2.50534245e-02  -10.02988110e-03...]
    

    And then, future accesses of word_vectors.wv['boy'] will return your updated values.

    Notes:

    • If you want syn0norm to be updated, to have the proper unit-normed vectors (as are used in most_similar() and other operations), it'd likely be best to modify syn0 first, then discard and recalculate syn0norm, via:

    word_vectors.wv.syn0norm = None
    word_vectors.wv.init_sims()
    

    • Adding new words would require more involved object-tampering, because it will require growing the syn0 (replacing it with a larger array), and updating the vocab dict