I have been trying to use the Sim-metrics library from:
<dependency>
<groupId>com.github.mpkorstanje</groupId>
<artifactId>simmetrics-core</artifactId>
<version>4.1.0</version>
</dependency>
So far I am computing Jaro Winkler using:
StringMetric sm = StringMetrics.jaroWinkler();
res = sm.compare("Harry Potter", "Potter Harry");
System.out.println(res);
0.43055558
and Cosine Similarity by:
sm = StringMetrics.overlapCoefficient();
res = sm.compare("The quick brown fox", "The slow brawn fur");
System.out.println(res);
0.25
but according to https://asecuritysite.com/forensics/simstring
The jaro-winkler should be 0 for this, and the overlap coeffecient should be 100. Is this even the correct way to use this library? What is the proper calls, say if I want to run both these metrics to match movies from one list to another I got from IMDB, I am intending to compare the titles from both sets and get the average of both scores and do the same for the cast from both sets of movies. Thanks
You are using the library correctly. You may however wish to customize the metric you are using. It sounds like filtering short, common words like 'the', 'a' 'and', ect, and using a q-gram tokenizer might be more effective then using the default metric from StringMetrics most of which tokenize on whitespace and none apply filters or simplifiers.
Beyond that I can't really tell you which combination metrics, tokenizers, filters and simplifiers may work for your use case. What works best is rather domain specific. You'll have to try a few combinations and see what works best.
When I use the website you provided to calculate the Cosine Similarity and Overlap Coefficient of The quick brown fox
and The slow brawn fur
I get:
String 1: The quick brown fox
String 2: The slow brawn fur
The results are then:
Cosine Similarity 25
Overlap Coefficient 25
When I use Simmetrics.
System.out.println(
StringMetrics.overlapCoefficient().compare(
"The quick brown fox", "The slow brawn fur")); // 0.25
System.out.println(
StringMetrics.cosineSimilarity().compare(
"The quick brown fox", "The slow brawn fur")); // 0.25
Regarding Jaro Winkler it looks like the website it using an older version of Simmetrics. The specific combination of metrics and names, specifically Chapman Length Deviation, which was originally written by the original author of Simmetrics Sam Chapman leave little doubt about it.
The older versions had some peculiarities though I can't point the specific one which is causing this difference without debugging them side by side again.