javastring-metric

Identify strings with same meaning in java


I have the following problem. I want to identify strings in java that have a similar meaning. I tried to calculate similarities between strings with Stringmetrics. This works as expected but I need something more convenient.

For example when I have the following 2 strings (1 word):

String s1 = "apple";
String s2 = "appel";

Then those 2 strings are very similar. When I use the cosine similarity then i get the following result:

double score = cosine.compare(s1, s2); // 0.0

But when I use damerau-levenshtein similarity I get the following result:

double score = damerauLevenshtein.compare(s1, s2); // 0.8

The next problem is that there are a lot of synonyms for words. With Stringmetrics these synonyms are not considered.

For example these 2 strings should be considered the same:

String s3 = "purchase 10 bottles of water";
String s4 = "buy 10 waterbottles";

I hope you guys can help me.


Solution

  • Levenshtein distance (edit distance) is like the auto-correct in your phone. Taking your example we have apple vs appel. The words are kinda close to each other if you consider adding/removing/replacing a single letter, all we need to do here is swap e and l (actually replace e with l and l with e). If you had other words like applr or appee - these are closer to the original word apple because all you need to do is replace a single letter.

    Cosine similiarity is completely different - it counts the words, makes vector of those counts and checks how similiar the counts are, here you have 2 completely different words so it returns 0.

    What you want is: combo of those 2 techniques + computer having language knowledge + another dictionary for synonyms that are somehow taken into consideration before and after using those similarity algorithms. Imagine if you had a sentence and then you would replace every single word with synonym (who remembers Joey and Thesaurus?). Sentences could be completely different. Plus every word can have multiple synonyms, and some of those synonyms can be used only in a specific context. Your task is simply impossible as of now, maybe in the future.

    P.S. If your task was possible I think that translating software would be basically perfect, but I'm not really sure about that.