I have a set of documents that all fit a pre-defined category and have successfully trained a model off of those documents.
The question is, if I have a novel document, how can I calculate how closely this new document lines up with my trained model?
My current solution:
novel_vector = model.infer_vector(novel_doc_words, steps = 20)
similarity_scores = model.docvecs.most_similar([novel_vector])
average = 0
for score in similarity_scores:
average += score[1]
overall_similarity = average/len(similarity_scores)
I was unable to find any convenience methods in the documentation
There's no built-in method to check this sort of "lines up with" value, with respect to the whole model.
A more typical approach, matching existing capabilities, would be to train a model on a diversity of documents – not just those in a specific category. Then, after inferring a new document's vector, calculate its average distance to documents of just the category of interest.
If you instead train a model on only documents of a certain self-similar category, the learned coordinate-space won't as well reflect the full range of possible documents outside that category.
That said, if your current code – which checks how similar a new document is to the top-N nearest neighbors - seems to give good results for your purposes, maybe it's acceptable. I'd just expect better results from a model that had trained on a wider variety of documents.