I'm using Doc2vec model. I pre-train model with dataset which contains more than 20K articles in Wikipedia. After that, I try to test result by calculate similarity between two sentences.
I have two sentences: 1. The process of searching for a job can be very stressful.
2. The job search process can be very stressful.
After I preprocess and tokenize I have list of words for sentence 1 is list_1 = ['process', 'search', 'job', 'stress']
and for sentence 2 is list_2 = ['job', 'search', 'process', 'stress']
.
But when after I use vec_1 = doc2vec_model.infer_vector(list_1)
and vec_2 = doc2vec_model.infer_vector(list_2)
. I usegensim.matutils.full2sparse
and gensim.matutils.cossim
to caculate similarity cossim.
I got result near 0 value like 0.00709335870
.
It seems not right. I think the result should be near 1.
What is my problem and how I fix this error?
This is a part of my code:
//model.tokenize_word(data['document_1'] is ['process', 'search', 'job', 'stress']
vec_1 = doc2vec_model.infer_vector(model.tokenize_word(data['document_1']))
doc2vec_model.random.seed(0)
// model.tokenize_word(data['document_2'] is ['job', 'search', 'process', 'stress']
vec_2 = doc2vec_model.infer_vector(model.tokenize_word(data['document_2']))
vec_1 = gensim.matutils.full2sparse(vec_1)
vec_2 = gensim.matutils.full2sparse(vec_2)
similarity = gensim.matutils.cossim(vec_1, vec_2)
print(similarity) // 0.00709335870
You haven't shown how you ran your Doc2Vec
training; something may have gone wrong there. If the exact same set of 4 words gives very-different infer_vector()
results – as opposed to just a little different results, as is normal with this stochastic algorithm – some problems might be:
I suggest:
INFO
level, and re-run your training, watching the output logs carefully. Verify that training takes time, & reports sensible values for the number of unique words in the model's vocabulary & total words in your corpus, & doesn't show errors/warningslen(doc2vec_model.dv)
and len(doc2vec_model.wv)
are sensible for the expected number of documents and known wordsIf you're then still having problems, expand your question text to show code & parameters you used to initialize & train the Doc2Vec
model, and some meaningful excerpts from the logging that convinced you things were otherwise working, and some details of the size of your corpus (like total word count, total doc count, and and average words per document).
Also note:
0.0
values in any of their dimensions, they should stay in their dense representation