[SOLVED] load pre-trained word2vec model for doc2vec

load pre-trained word2vec model for doc2vec

I'm using gensim to extract feature vector from a document. I've downloaded the pre-trained model from Google named GoogleNews-vectors-negative300.bin and I loaded that model using the following command:

model = models.Doc2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

My purpose is to get a feature vector from a document. For a word, it's very easy to get the corresponding vector:

vector = model[word]

However, I don't know how to do it for a document. Could you please help?

Solution

A set of word vectors (such as GoogleNews-vectors-negative300.bin) is neither necessary nor sufficient for the kind of text vectors (Le/Mikolov 'Paragraph Vectors') created by the Doc2Vec class. It instead expects to be trained with example texts to learn per-document vectors. Then, also, the trained model can be used to 'infer' vectors for other new documents.

(The Doc2Vec class only supports the load_word2vec_format() method because it inherits from the Word2Vec class – not because it needs that functionality.)

There's another simple kind of text vector that can be created by simply averaging all the words in the document, perhaps also according to some per-word significance weighting. But that's not what Doc2Vec provides.