searchnlpfull-text-searchsimilaritysentence-similarity

How to detect if two sentences are simmilar, not in meaning, but in syllables/words?


Here are some examples of the types of sentences that need to be considered "similar"

there was a most extraordinary noise going on shrinking rapidly she soon made out
there was a most extraordinary noise going on shrinking rapid
that will be a very little alice knew it was just possible it had
thou wilt be very little alice i knew it was possible to add
however at last it sat down and looked very anxiously into her face and
however that lives in sadtown and look very anxiously into him facing it
she went in search of her or of anything to say she simply bowed
she went in the search of her own or of anything to say
and she squeezed herself up on tiptoe and peeped over the wig he did
and she squeezed herself up on the tiptoe and peeped over her wig he did
she had not noticed before and behind it was very glad to find that
she had not noticed before and behind it it was very glad to find that
as soon as the soldiers had to fall a long hookah and taking not
soon as the soldiers have to fall along huka and taking knots

And here are some examples of more difficult edge cases I would be able to like to catch, but are not as necessary

so she tucked it under her arm with its head it would not join
she tucked it under her arm with its head
let me see four times five is twelve and four times five is twelve 
let me see  times  is  and  times  is
let me see four times seven is oh dear run home this moment and 
times  is o dear run home this moment and
in a minute or two she walked sadly down the middle being held up 
and then well see you sidely down the middle in health often

Sentences that are somewhat different and have no such similarities need to be marked as dissimilar. If there is an algorithm that exists that outputs a "score" versus just a boolean similar or not, I could determine what threshold would be necessary through my own testing.

The top sentence in each example is randomly generated; the bottom sentence is the output of a speech-to-text neural network, from an audio file of someone reading out the top line. If there is some syllabic comparison method that would be much more accurate given that I have the initial source text as well as the audio, I could also employ that instead of this word comparison technique.

My current method involves indexing each word, once forwards, and once reverse, and then checking how many words line up. If at least 10 words match in either indexing order, I count the sentences as similar. However, all of the presented examples are cases where this strategy does not work.


Solution

  • This is exactly similar to my answer above, but this is in nodejs. Apart from the language difference, code works exactly the same.

    First you need to install the natural module using npm.

        npm install natural
    
    
        const natural = require('natural');
        
        function dotProduct(vector1, vector2) {
            return vector1.reduce((acc, val, index) => acc + val * vector2[index], 0);
        }
        
        function magnitude(vector) {
            return Math.sqrt(vector.reduce((acc, val) => acc + val * val, 0));
        }
        
        function cosineSimilarity(vector1, vector2) {
            const dotProd = dotProduct(vector1, vector2);
            const mag1 = magnitude(vector1);
            const mag2 = magnitude(vector2);
        
            if (mag1 === 0 || mag2 === 0) {
                return 0; // Avoid division by zero
            }
        
            return dotProd / (mag1 * mag2);
        }
        
        function sentenceSimilarity(sentence1, sentence2) {
            // Tokenizing sentences
            const tokenizer = new natural.WordTokenizer();
            const sentence1Tokens = tokenizer.tokenize(sentence1);
            const sentence2Tokens = tokenizer.tokenize(sentence2);
        
            // Creating a bade of words from tokens
            const bagOfWords = new Set([...sentence1Tokens, ...sentence2Tokens]);
        
            // Convert tokens to vectors
            const vector1 = Array.from(bagOfWords).map(word => sentence1Tokens.includes(word) ? 1 : 0);
            const vector2 = Array.from(bagOfWords).map(word => sentence2Tokens.includes(word) ? 1 : 0);
        
            // Calculate cosine similarity
            const similarity = cosineSimilarity(vector1, vector2);
        
            return similarity;
        }
        
        // Example usage
        const sentence1 = "This is a sentence.";
        const sentence2 = "This is another sentence.";
        const similarityScore = sentenceSimilarity(sentence1, sentence2);
        console.log("Similarity score:", similarityScore);
    
    

    Function dotProduct(), magnitude(), and cosineSimilarity() were needed to be defined since I was not able to find a library that provides these in node unlike in python. Apart from that, all the other logic are similar to the python code above.