I have the following working Java code for searching for a word against a list of words and it works perfectly and as expected:
public class Levenshtein {
private int[][] wordMartix;
public Set similarExists(String searchWord) {
int maxDistance = searchWord.length();
int curDistance;
int sumCurMax;
String checkWord;
// preventing double words on returning list
Set<String> fuzzyWordList = new HashSet<>();
for (Object wordList : Searcher.wordList) {
checkWord = String.valueOf(wordList);
curDistance = calculateDistance(searchWord, checkWord);
sumCurMax = maxDistance + curDistance;
if (sumCurMax == checkWord.length()) {
fuzzyWordList.add(checkWord);
}
}
return fuzzyWordList;
}
public int calculateDistance(String inputWord, String checkWord) {
wordMartix = new int[inputWord.length() + 1][checkWord.length() + 1];
for (int i = 0; i <= inputWord.length(); i++) {
wordMartix[i][0] = i;
}
for (int j = 0; j <= checkWord.length(); j++) {
wordMartix[0][j] = j;
}
for (int i = 1; i < wordMartix.length; i++) {
for (int j = 1; j < wordMartix[i].length; j++) {
if (inputWord.charAt(i - 1) == checkWord.charAt(j - 1)) {
wordMartix[i][j] = wordMartix[i - 1][j - 1];
} else {
int minimum = Integer.MAX_VALUE;
if ((wordMartix[i - 1][j]) + 1 < minimum) {
minimum = (wordMartix[i - 1][j]) + 1;
}
if ((wordMartix[i][j - 1]) + 1 < minimum) {
minimum = (wordMartix[i][j - 1]) + 1;
}
if ((wordMartix[i - 1][j - 1]) + 1 < minimum) {
minimum = (wordMartix[i - 1][j - 1]) + 1;
}
wordMartix[i][j] = minimum;
}
}
}
return wordMartix[inputWord.length()][checkWord.length()];
}
}
Right now when I search for a word like job
it returns a list:
Output
joborienterede
jobannoncer
jobfunktioner
perjacobsen
jakobsen
jobprofiler
jacob
jobtitler
jobbet
jobdatabaserne
jobfunktion
jakob
jobs
studenterjobber
johannesburg
jobmuligheder
jobannoncerne
jobbaser
job
joberfaringer
As you can see the output has a lot of related words but has also non-related ones like jakob
, jacob
etc., which is correct regarding the Levenshtein formula, but I would like to build further and write a method that can fine-tune my search so I can get more relevant and related words.
I have worked a few hours on it and lost sight of creativity.
My Question: Is it possible to fine-tune the existing method to return relevant/related words Or should I take another approach??? in all cases YES or NO, I would appreciate it if could get input and inspiration regarding improving the search results.
UPDATE
After asking this question a long time back I have not really found a solution and I am back to it because it is the time when I need a useful answer, it is fine to supply the answer with JAVA code samples, but what is most important is a detailed answer with a description of available methods and approaches used to index best and most relevant search results and ignoring none appropriate words. I know this is an open and endless area, but I need the inspiration to start somewhere.
Note: The oldest answer right now is based on one of the comment inputs and is not helpful (useless), it just sorts the distance, which does not mean getting better search results/quality.
So I did distance sorting and the results was like this:
job
jobs
jacob
jakob
jobbet
jakobsen
jobbaser
jobtitler
jobannoncer
jobfunktion
jobprofiler
perjacobsen
johannesburg
jobannoncerne
joberfaringer
jobfunktioner
jobmuligheder
jobdatabaserne
joborienterede
studenterjobber
so the word jobbaser is relevant and jacob/jakob is not relevant, but the distance for jobbaser is more considerable than jacob/jakob. So that did not really help.
General feedback regarding answers
Thanks I would like to personally thank all of you who contributed to this question, I have got nice answers and useful comments.
Special thanks to answers from @SergioMontoro, @uSeemSurprised, and @Gene, those are different but valid and useful answers.
@D.Kovács is pointing out some interesting solutions.
I wish I could give a bounty to all of those answers. Choose one answer and give it a bounty, that does not mean the other answers are not valid, but that only means that the particular answer I chose was useful for me.
Without understanding the meaning of the words like @DrYap suggests, the next logical unit to compare two words (if you are not looking for misspellings) is syllables. It is very easy to modify Levenshtein to compare syllables instead of characters. The hard part is breaking the words into syllables. There is a Java implementation TeXHyphenator-J which can be used to split the words. Based on this hyphenation library, here is a modified version of Levenshtein function written by Michael Gilleland & Chas Emerick. More about syllable detection here and here. Of course, you'll want to avoid syllable comparison of two single syllable words probably handling this case with standard Levenshtein.
import net.davidashen.text.Hyphenator;
public class WordDistance {
public static void main(String args[]) throws Exception {
Hyphenator h = new Hyphenator();
h.loadTable(WordDistance.class.getResourceAsStream("hyphen.tex"));
getSyllableLevenshteinDistance(h, args[0], args[1]);
}
/**
* <p>
* Calculate Syllable Levenshtein distance between two words </p>
* The Syllable Levenshtein distance is defined as the minimal number of
* case-insensitive syllables you have to replace, insert or delete to transform word1 into word2.
* @return int
* @throws IllegalArgumentException if either str1 or str2 is <b>null</b>
*/
public static int getSyllableLevenshteinDistance(Hyphenator h, String s, String t) {
if (s == null || t == null)
throw new NullPointerException("Strings must not be null");
final String hyphen = Character.toString((char) 173);
final String[] ss = h.hyphenate(s).split(hyphen);
final String[] st = h.hyphenate(t).split(hyphen);
final int n = ss.length;
final int m = st.length;
if (n == 0)
return m;
else if (m == 0)
return n;
int p[] = new int[n + 1]; // 'previous' cost array, horizontally
int d[] = new int[n + 1]; // cost array, horizontally
for (int i = 0; i <= n; i++)
p[i] = i;
for (int j = 1; j <= m; j++) {
d[0] = j;
for (int i = 1; i <= n; i++) {
int cost = ss[i - 1].equalsIgnoreCase(st[j - 1]) ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);
}
// copy current distance counts to 'previous row' distance counts
int[] _d = p;
p = d;
d = _d;
}
// our last action in the above loop was to switch d and p, so p now actually has the most recent cost counts
return p[n];
}
}