javascriptfuzzy-search

Javascript partial string fuzzy match?


Imagine I need to compare two strings to one reference text:

Reference Text:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Search Text 1:

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. (Matching 1 sentence in the reference exactly)

Search Text 2:

I use the LLM (Lawyer, Liar, or Manager) model to determine how to respond to user input based on their tone and word choice. If the user's tone and word choice indicate that they are expressing a legal concern, I will refer them to a lawyer. If the user's tone and word choice indicate that they are lying, I will call them out on it and encourage them to be honest. If the user's tone and word choice indicate that they are expressing a managerial concern, I will offer them guidance and support. (Completely different from reference)

I need a way or a library that can confidently tell me that search text 1 is way more similar to reference text than search text 2, since there is actually one sentence that matches it exactly.

Unfortunately, it appears that most of the popular javascript string similarity libraries fails to identify similarities when two comparing strings have very different length. They will all incorrectly conclude that both search texts 1 and 2 are equally similar to the reference:

var stringSimilarity = require("string-similarity");
var levenshtein = require('fast-levenshtein');
var similarity = require('similarity')

stringSimilarity.compareTwoStrings(ref, text1); //0.3
stringSimilarity.compareTwoStrings(ref, text2); //0.359
levenshtein.get(ref, text1); //338
levenshtein.get(ref, text2); //379
similarity(ref, text1); //0.24
similarity(ref, text2); //0.24

Is there a better library that can achieve partial string fuzzy match like above?


Solution

  • Without knowing all the requirements and edge cases you may want to handle, it is impossible to suggest a perfect solution. But if you are fine with a straight brute force comparing the same words at similar positions, here is an option:

    const text1 = `Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.`;
    
    const text2 = `Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.`;
    
    const text3 = `I use the LLM (Lawyer, Liar, or Manager) model to determine how to respond to user input based on their tone and word choice. If the user's tone and word choice indicate that they are expressing a legal concern, I will refer them to a lawyer. If the user's tone and word choice indicate that they are lying, I will call them out on it and encourage them to be honest. If the user's tone and word choice indicate that they are expressing a managerial concern, I will offer them guidance and support.`;
    
    const text4 = `Ut bla bla enim garbage ad minim bla veniam, quis bla bla nostrud exercitation more garbage ullamco labori bla nisi ut aliquip ex bla ea commodo bla consequat.`;
    
    
    const compare = (a, b) => {
      const ax = a.replace(/[^A-Za-z0-9]/g, ' ')
        .split(' ')
        .map(s => s.toLowerCase())
        .filter(s => s);
      const bx = b.replace(/[^A-Za-z0-9]/g, ' ')
        .split(' ')
        .map(s => s.toLowerCase())
        .filter(s => s);
        
      let similar = 0;
      for (let ia = 0; ia < ax.length; ia ++) {
        for (let ib = 0; ib < bx.length; ib ++) {
          if (ax[ia] === bx[ib]) {
            ia ++;
            similar ++;
          }
        }
      }
      return similar
        ? (similar / ax.length + similar / bx.length) / 2
        : 0;
    };
    
    console.log(compare(text1, text2));
    console.log(compare(text1, text3));
    console.log(compare(text2, text3));
    console.log(compare(text2, text4));
    console.log(compare(text2, text2));