nlpgensimcosine-similaritydoc2vec

How to get most similar words to a tagged document in gensim doc2vec


I have trained a doc2vec model.

doc2vec = Doc2Vec(vector_size= 300,
                    window=10,
                    min_count=100, 
                    dm=1,
                    epochs=40)
doc2vec.build_vocab(corpus_file=train_data, progress_per=1000)
doc2vec.train(....)

The documents are tagged with incremental integer 0, 1, ...1000.

To get the top-n similar words to a document with tag=0, I used:

doc_vector = doc2vec.dv[tag]
sims = doc2vec.wv.similar_by_vector(doc_vector, top_n=20)

The similarity makes sense, however, the similarity score are really looked "weird", all of them are almost 1.0. I checked top_n=3000 and it is still around 1.0. Does it make sense to get all words with high similarity score.


Solution

  • In traditional uses of this algorithm with varies natural-language texts, no, it's not typical to have the most-similar 'nearest neighbors' of texts all have near-1.0 similarities.

    That suggests something may amiss with your setup – unless of course your data does include lots of 'texts' that are nearly-identical.

    Are you perhaps using some atypical corpus, maybe not natural language, where so many super-close similarity scores are still accurate & useful? That's the ultimate test.

    That is: if for a bunch of doc probes, the "similar words" vary, and are individually sensible/useful with regard to the origin documents, I'd not worry too much about the absolute magnitudes of the similarity scores.

    Such scores have more meaning in comparison to each other than on any absolute scale. A similarity of 0.9 is not meaningfully interpretable as "X% similar" or even "among top X% most-similar candidates. It only means, "more similar than items with 0.8 similarity, and less similar than items with 0.95 similarity".

    Some things to look at, if wanting to get better sense if things may be going wrong:

    Are there at least some words that have far-lower similarity-scores, and do those make sense? Do doc-to-doc comparisons seem roughly sensible?

    What is is the rough character & size of your data, and how many docs & unique-words are in your corpus? How many surviving-words (after applying the min_count=100 cutoff) remain in the trained model?

    If you run with logging enabled, do the various step & progress reports suggest that a model of the expected size, & training effort, is being created?