lucenejaro-winkler

JarowinklerDistance in lucene is returning strange results


I have a file containing some phrases. Using jarowinkler by lucene, it is supposed to get me the most similar phrases of my input from that file.

Here is an example of my problem.

We have a file containing:

//phrases.txt
this is goodd
this is good
this is god

If my input is this is good, it is supposed to get me 'this is good' from the file first, since the similarity score here is the biggest (1). But for some reason, it returns: "this is goodd" and "this is god" only!

Here is my code:

try {
    SpellChecker spellChecker = new SpellChecker(new RAMDirectory(), new JaroWinklerDistance());
    Dictionary dictionary = new PlainTextDictionary(new File("src/main/resources/words.txt").toPath());
    IndexWriterConfig iwc=new IndexWriterConfig(new ShingleAnalyzerWrapper());
    spellChecker.indexDictionary(dictionary,iwc,false);

    String wordForSuggestions = "this is good";

    int suggestionsNumber = 5;

    String[] suggestions = spellChecker.suggestSimilar(wordForSuggestions, suggestionsNumber,0.8f);
    if (suggestions!=null && suggestions.length>0) {
        for (String word : suggestions) {
            System.out.println("Did you mean:" + word);
        }
    }
    else {
        System.out.println("No suggestions found for word:"+wordForSuggestions);
    }
} catch (IOException e) {
    e.printStackTrace();
} 

Solution

  • suggestSimilar won't provide suggestions which are identical to the input. To quote the source code:

    // don't suggest a word for itself, that would be silly

    If you want to know whether wordForSuggestions is in the dictionary, use the exist method:

    if (spellChecker.exist(wordForSuggestions)) {
        //do what you want for an, apparently, correctly spelled word
    }