pythonnlpdoc2vecopensemanticsearch

Finding similarity of 1 paragraph in different documents with Doc2vec


how to find one target paragraph or document similar to other lists of documents to the target paragraph that is semantically similar.

import os
import gensim
import smart_open
import random
from nltk.tokenize import word_tokenize
# Set file names for train and test data
test_data_dir =('C:\\Users\\hamza\\Desktop\\')
train_file = os.path.join(test_data_dir, 'read-me.txt')
target_file = os.path.join(test_data_dir, 'read-me2.txt')

def read_file(filename):
    
    try:
        with open(filename, 'r') as f:
            data = f.read()
        return data
    
    except IOError:
        print("Error opening or reading input file: ", filename)
        sys.exit()
def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            tokens = gensim.utils.simple_preprocess(line)
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

train_data = list(read_corpus(train_file))
target_data = word_tokenize(read_file(target_file))

# print(target_data)
# print(test_corpus)
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(train_data)
# print(f"Word 'noise' appeared {model.wv.get_vecattr('noise', 'count')} times in the training corpus.")
model.train(train_data, total_examples=model.corpus_count, epochs=model.epochs)
inferred_vector = model.infer_vector(target_data)
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
print(sims)

Output

[(1, 0.20419511198997498), (2, 0.1924923211336136), (0, 0.10696495324373245)]

Now how I can match target data to train data and how I will know how much they are similar is there any way to scale the similarity into percentage?


Solution

  • Despite the class name Doc2Vec, and the fact that it is based on an algorithm called 'Paragraph Vectors', this algorithm for modeling text has no inherent idea what 'paragraphs' or 'documents' are.

    It simply takes whatever texts you give it – where each text is a list-of-words – & learns a way to plot those texts into a vector-space for comparisons.

    So, using it to "match one target document or paragraph in other documents and bring them as a list how much they are semantically similar" is one possible application:

    If this & the tutorial isn't enough to make progress, you should better explain more about what your data is, what you've tried so far, and where things haven't yet worked – with as much of your code, and precise info about what has and hasn't been achieved yet, as possible.

    (It's nearly impossible to give a helpful answer to "guide me through this generic underspecified project". But if you say instead – "I have data D & want to achieve well-described goal G. I've tried X, but only had result or error Y so far, when my ideal result would be more like Z. What would help me get from my progress Y so far, to my desired result Z?" – then it is possible to give tangible tips/pointers/explanations.)