doc2vec

Why result in doc2vec is wrong with same tokenize word list?


I'm using Doc2vec model. I pre-train model with dataset which contains more than 20K articles in Wikipedia. After that, I try to test result by calculate similarity between two sentences.

I have two sentences: 1. The process of searching for a job can be very stressful. 2. The job search process can be very stressful.

After I preprocess and tokenize I have list of words for sentence 1 is list_1 = ['process', 'search', 'job', 'stress'] and for sentence 2 is list_2 = ['job', 'search', 'process', 'stress']. But when after I use vec_1 = doc2vec_model.infer_vector(list_1) and vec_2 = doc2vec_model.infer_vector(list_2). I usegensim.matutils.full2sparse and gensim.matutils.cossim to caculate similarity cossim.

I got result near 0 value like 0.00709335870. It seems not right. I think the result should be near 1.

What is my problem and how I fix this error?

This is a part of my code:

//model.tokenize_word(data['document_1'] is  ['process', 'search', 'job', 'stress']
    vec_1 = doc2vec_model.infer_vector(model.tokenize_word(data['document_1'])) 
    doc2vec_model.random.seed(0)
    
// model.tokenize_word(data['document_2'] is  ['job', 'search', 'process', 'stress']
    vec_2 = doc2vec_model.infer_vector(model.tokenize_word(data['document_2']))
    vec_1 = gensim.matutils.full2sparse(vec_1)
    vec_2 = gensim.matutils.full2sparse(vec_2)

    similarity = gensim.matutils.cossim(vec_1, vec_2)
    print(similarity) // 0.00709335870

Solution

  • You haven't shown how you ran your Doc2Vec training; something may have gone wrong there. If the exact same set of 4 words gives very-different infer_vector() results – as opposed to just a little different results, as is normal with this stochastic algorithm – some problems might be:

    I suggest:

    If you're then still having problems, expand your question text to show code & parameters you used to initialize & train the Doc2Vec model, and some meaningful excerpts from the logging that convinced you things were otherwise working, and some details of the size of your corpus (like total word count, total doc count, and and average words per document).

    Also note: