Gensim's official tutorial explicitly states that it is possible to continue training a (loaded) model. I'm aware that according to the documentation it is not possible to continue training a model that was loaded from the word2vec
format. But even when one generates a model from scratch and then tries to call the train
method, it is not possible to access the newly created labels for the LabeledSentence
instances supplied to train
.
>>> sentences = [LabeledSentence(['first', 'sentence'], ['SENT_0']), LabeledSentence(['second', 'sentence'], ['SENT_1'])]
>>> model = Doc2Vec(sentences, min_count=1)
>>> print(model.vocab.keys())
dict_keys(['SENT_0', 'SENT_1', 'sentence', 'first', 'second'])
>>> sentence = LabeledSentence(['third', 'sentence'], ['SENT_2'])
>>> model.train([sentence])
>>> print(model.vocab.keys())
# At this point I would expect the key 'SENT_2' to be present in the vocabulary, but it isn't
dict_keys(['SENT_0', 'SENT_1', 'sentence', 'first', 'second'])
Is it at all possible to continue the training of a Doc2Vec model in Gensim with new sentences? If so, how can this be achieved?
My understand is that this is not possible for any new labels. We can only continue training when the new data has the same labels as the old data. As a result, we are training or retuning the weights of the already learned vocabulary, but are not able to learn a new vocabulary.
There is a similar question for adding new labels/words/sentences during training: https://groups.google.com/forum/#!searchin/word2vec-toolkit/online$20word2vec/word2vec-toolkit/L9zoczopPUQ/_Zmy57TzxUQJ
Also, you might want to keep an eye on this discussion: https://groups.google.com/forum/#!topic/gensim/UZDkfKwe9VI
Update: If you want to add new words to an already trained model, take a look at online word2vec here: https://rutumulkar.com/ml-notes/word2vec/representation%20learning/2015/08/22/word2vec.html