I try to find information about problem that Doc2vec returns different results when it runs. I saw many previous questions about this and I know It happens because vector is randomly initialize. However, I am creating a website which uses this result to display in frontend. The difference in results makes reliability of systems reduce.
I know my dataset is really small. But infer_vector()
can't return same vectors with same documents and results most_similar()
are different in each run. How do I prevent this problem or having alternative way to apply doc2vec model in my application to avoid difference of results?
This is some code:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, dm=1, window=5, min_count=2, epochs=100, negative=0, workers=5)
But I received warning: You must set either 'hs' or 'negative' to be positive for proper training. When both 'hs=0' and 'negative=0', there will be no training.
I try to set negative=-1
but I see explain from gensim
: negative
must be integer.
These are potentially, two different issues.
With regard to the warning you're seeing:
You must set either 'hs' or 'negative' to be positive
for proper training. When both 'hs=0' and 'negative=0',
there will be no training.
The warning is complete and truthful, it already describes what you're doing wrong and how to solve it.
You must set either hs
or negative
to be positive or else no training will happen in your model.
negative=-1
is an illegal setting, and not positive.
If you want to use Doc2Vec
, you need to either have the negative
parameter as a positive integer (as with its default value negative=5
), or if you want to set negative=0
then you need to enable the alternative "hierarchical softmax" mode with hs=1
.
The algorithm will do nothing but error or given nonsense untrained results if you give it illegal configurations.
As is explained in the Q12 of the Gensim Project FAQ & other StackOverflow answers, the operation of the Doc2Vec
algorithm naturally allows for variance in the vectors returned by infer_vector()
from run to run.
And, if that "jitter" between inferences is s making a big difference in results, there are probably other serious problems in your use of Doc2Vec
, such as insufficient data or bad parameters, that you should fix, rather than trying to force a false determinism onto your calculations.
In particular, if the model whose changing infer_vector()
results was "trained" – not really – with the shown parameters (negative=0
without enabled hs
), ignoring the warning that won't work, that is the first big problem to solve. It will make all inferred vetor random and meaninglfess (as opposed to just "a little noisy").
But, if after fixing the total failure of training you then insistently want to do the incorrect thing, you can force inference determinism as is described in another answer at: