I have a Word2Vec model with a lot of word vectors. I can access a word vector as so.
word_vectors = gensim.models.Word2Vec.load(wordspace_path)
print(word_vectors['boy'])
Output
[ -5.48055351e-01 1.08748421e-01 -3.50534245e-02 -9.02988110e-03...]
Now I have a proper vector representation that I want to replace the word_vectors['boy'] with.
word_vectors['boy'] = [ -7.48055351e-01 3.08748421e-01 -2.50534245e-02 -10.02988110e-03...]
But the following error is thrown
TypeError: 'Word2Vec' object does not support item assignment
Is there any fashion or workaround to do this? That is manipulate word vectors manually once the model is trained? Is it possible in other platforms except Gensim?
Since word2vec vectors are typically only created by the iterative training process, then accessed, the gensim Word2Vec
object does not support direct assignment of new values by its word indexes.
However, as it is in Python, all its internal structures are fully viewable/tamperable by you, and as it is open-source, you can view exactly how it does all of its existing functionality, and use that as a model for how to do new things.
Specifically, the raw word-vectors are (in recent versions of gensim) stored in a property of the Word2Vec
object called wv
, and this wv
property is an instance of KeyedVectors
. If you examine its source code, you can see accesses of word-vectors by string key (eg 'boy'
), including those by []
-indexing implemented by the __getitem__()
method, go through its method word_vec()
. You can view the source of that method either in your local installation, or at Github:
There you'll see the word is actually converted to an integer-index (via self.vocab[word].index
) then used to access an internal syn0
or syn0norm
array (depending on whether the user is accessing the raw or unit-normalized vector). If you look elsewhere where these are set up, or simply examine them in your own console/code (as if by word_vectors.wv.syn0
), you'll see these are numpy
arrays which do support direct assignment by index.
So, you can directly tamper with their values by integer index, as if by:
word_vectors.wv.syn0[word_vectors.wv.vocab['boy'].index] = [ -7.48055351e-01 3.08748421e-01 -2.50534245e-02 -10.02988110e-03...]
And then, future accesses of word_vectors.wv['boy']
will return your updated values.
Notes:
• If you want syn0norm
to be updated, to have the proper unit-normed vectors (as are used in most_similar()
and other operations), it'd likely be best to modify syn0
first, then discard and recalculate syn0norm
, via:
word_vectors.wv.syn0norm = None
word_vectors.wv.init_sims()
• Adding new words would require more involved object-tampering, because it will require growing the syn0
(replacing it with a larger array), and updating the vocab
dict