how to find one target paragraph or document similar to other lists of documents to the target paragraph that is semantically similar.
import os
import gensim
import smart_open
import random
from nltk.tokenize import word_tokenize
# Set file names for train and test data
test_data_dir =('C:\\Users\\hamza\\Desktop\\')
train_file = os.path.join(test_data_dir, 'read-me.txt')
target_file = os.path.join(test_data_dir, 'read-me2.txt')
def read_file(filename):
try:
with open(filename, 'r') as f:
data = f.read()
return data
except IOError:
print("Error opening or reading input file: ", filename)
sys.exit()
def read_corpus(fname, tokens_only=False):
with smart_open.open(fname, encoding="iso-8859-1") as f:
for i, line in enumerate(f):
tokens = gensim.utils.simple_preprocess(line)
if tokens_only:
yield tokens
else:
# For training data, add tags
yield gensim.models.doc2vec.TaggedDocument(tokens, [i])
train_data = list(read_corpus(train_file))
target_data = word_tokenize(read_file(target_file))
# print(target_data)
# print(test_corpus)
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(train_data)
# print(f"Word 'noise' appeared {model.wv.get_vecattr('noise', 'count')} times in the training corpus.")
model.train(train_data, total_examples=model.corpus_count, epochs=model.epochs)
inferred_vector = model.infer_vector(target_data)
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
print(sims)
Output
[(1, 0.20419511198997498), (2, 0.1924923211336136), (0, 0.10696495324373245)]
Now how I can match target data to train data and how I will know how much they are similar is there any way to scale the similarity into percentage?
Despite the class name Doc2Vec
, and the fact that it is based on an algorithm called 'Paragraph Vectors', this algorithm for modeling text has no inherent idea what 'paragraphs' or 'documents' are.
It simply takes whatever texts you give it – where each text is a list-of-words – & learns a way to plot those texts into a vector-space for comparisons.
So, using it to "match one target document or paragraph in other documents and bring them as a list how much they are semantically similar" is one possible application:
Doc2Vec
model with your full set of texts..infer_vector()
method) vectors for new texts that use the same words as the model already knows.model.dv[tag]
. Get a vector for a new text with model.infer_vector(list_of_words)
. Compare those vectors using any vector operations you'd like.model.dv.most_similar()
- you can either supply a tag (to name one of the training documents) or a raw vector (via the positive
argument) as the target point.If this & the tutorial isn't enough to make progress, you should better explain more about what your data is, what you've tried so far, and where things haven't yet worked – with as much of your code, and precise info about what has and hasn't been achieved yet, as possible.
(It's nearly impossible to give a helpful answer to "guide me through this generic underspecified project". But if you say instead – "I have data D & want to achieve well-described goal G. I've tried X, but only had result or error Y so far, when my ideal result would be more like Z. What would help me get from my progress Y so far, to my desired result Z?" – then it is possible to give tangible tips/pointers/explanations.)