search nlp full-text-search similarity sentence-similarity

How to detect if two sentences are simmilar, not in meaning, but in syllables/words?

Here are some examples of the types of sentences that need to be considered "similar"

there was a most extraordinary noise going on shrinking rapidly she soon made out
there was a most extraordinary noise going on shrinking rapid

that will be a very little alice knew it was just possible it had
thou wilt be very little alice i knew it was possible to add

however at last it sat down and looked very anxiously into her face and
however that lives in sadtown and look very anxiously into him facing it

she went in search of her or of anything to say she simply bowed
she went in the search of her own or of anything to say

and she squeezed herself up on tiptoe and peeped over the wig he did
and she squeezed herself up on the tiptoe and peeped over her wig he did

she had not noticed before and behind it was very glad to find that
she had not noticed before and behind it it was very glad to find that

as soon as the soldiers had to fall a long hookah and taking not
soon as the soldiers have to fall along huka and taking knots

And here are some examples of more difficult edge cases I would be able to like to catch, but are not as necessary

so she tucked it under her arm with its head it would not join
she tucked it under her arm with its head

let me see four times five is twelve and four times five is twelve 
let me see  times  is  and  times  is

let me see four times seven is oh dear run home this moment and 
times  is o dear run home this moment and

in a minute or two she walked sadly down the middle being held up 
and then well see you sidely down the middle in health often

Sentences that are somewhat different and have no such similarities need to be marked as dissimilar. If there is an algorithm that exists that outputs a "score" versus just a boolean similar or not, I could determine what threshold would be necessary through my own testing.

The top sentence in each example is randomly generated; the bottom sentence is the output of a speech-to-text neural network, from an audio file of someone reading out the top line. If there is some syllabic comparison method that would be much more accurate given that I have the initial source text as well as the audio, I could also employ that instead of this word comparison technique.

My current method involves indexing each word, once forwards, and once reverse, and then checking how many words line up. If at least 10 words match in either indexing order, I count the sentences as similar. However, all of the presented examples are cases where this strategy does not work.

Solution

This is exactly similar to my answer above, but this is in nodejs. Apart from the language difference, code works exactly the same.

First you need to install the natural module using npm.

    npm install natural


    const natural = require('natural');
    
    function dotProduct(vector1, vector2) {
        return vector1.reduce((acc, val, index) => acc + val * vector2[index], 0);
    }
    
    function magnitude(vector) {
        return Math.sqrt(vector.reduce((acc, val) => acc + val * val, 0));
    }
    
    function cosineSimilarity(vector1, vector2) {
        const dotProd = dotProduct(vector1, vector2);
        const mag1 = magnitude(vector1);
        const mag2 = magnitude(vector2);
    
        if (mag1 === 0 || mag2 === 0) {
            return 0; // Avoid division by zero
        }
    
        return dotProd / (mag1 * mag2);
    }
    
    function sentenceSimilarity(sentence1, sentence2) {
        // Tokenizing sentences
        const tokenizer = new natural.WordTokenizer();
        const sentence1Tokens = tokenizer.tokenize(sentence1);
        const sentence2Tokens = tokenizer.tokenize(sentence2);
    
        // Creating a bade of words from tokens
        const bagOfWords = new Set([...sentence1Tokens, ...sentence2Tokens]);
    
        // Convert tokens to vectors
        const vector1 = Array.from(bagOfWords).map(word => sentence1Tokens.includes(word) ? 1 : 0);
        const vector2 = Array.from(bagOfWords).map(word => sentence2Tokens.includes(word) ? 1 : 0);
    
        // Calculate cosine similarity
        const similarity = cosineSimilarity(vector1, vector2);
    
        return similarity;
    }
    
    // Example usage
    const sentence1 = "This is a sentence.";
    const sentence2 = "This is another sentence.";
    const similarityScore = sentenceSimilarity(sentence1, sentence2);
    console.log("Similarity score:", similarityScore);

Function dotProduct(), magnitude(), and cosineSimilarity() were needed to be defined since I was not able to find a library that provides these in node unlike in python. Apart from that, all the other logic are similar to the python code above.