vbadamerau-levenshtein

Damerau-Levenshtein algorithm isn't working on short strings


I have a for loop that takes a user's input and one of the keys in my dictionary and passes them to a Damerau-Levenshtein function and based on the distance, overwrites the user's input with the dictionary key (The for loop is to cycle through each dictionary key). This works fine enough for strings larger than three characters, but if the string is three or fewer characters the algorithm returns with the wrong key. Here's the for loop:

    1950        For j = 0 To dict.Count - 1
    1960            distance = DamerauLevenshtein(SplitStr(i), dict.Keys(j))
    1970            'MsgBox dict.Keys(j) & vbCrLf & distance ' used for debugging
    1980            If distance < 4 Then
    1990                If distance < leastDist Then
    2000                    leastDist = distance
    2010                    SplitStr(i) = dict.Keys(j)
    2020                End If
    2030            End If
    2040        Next
    2050        MsgBox "The distance is: " & leastDist & vbCrLf & "The entered text was " & tempStr & vbCrLf & "The replaced word is " & SplitStr(i)

SplitStr(i) holds the user's input, which comes from a split function. I arbitrarily picked 4 for a good distance

I stole the algorithm from a bytes.com forum post. Algorithm below:

Function DamerauLevenshtein(str1, str2, Optional intSize = 256)
  Dim intTotalLen, arrDistance, intLen1, intLen2, i, j, arrStr1, arrStr2, arrDA, intMini
  Dim intDB, intI1, intJ1, intD

  str1 = UCase(str1)
  str2 = UCase(str2)
  intLen1 = Len(str1)
  intLen2 = Len(str2)
  intTotalLen = intLen1 + intLen2
  ReDim arrStr1(intLen1)
  ReDim arrStr2(intLen2)
  ReDim arrDA(intSize)
  ReDim arrDistance(intLen1 + 2, intLen2 + 2)
  arrDistance(0, 0) = intTotalLen

  For i = 0 To intSize - 1
      arrDA(i) = 0
  Next

  For i = 0 To intLen1
      arrDistance(i + 1, 1) = i
      arrDistance(i + 1, 0) = intTotalLen
  Next

  For i = 1 To intLen1
      arrStr1(i - 1) = Asc(Mid(str1, i, 1))
  Next

  For j = 0 To intLen2
      arrDistance(1, j + 1) = j
      arrDistance(0, j + 1) = intTotalLen
  Next

  For j = 1 To intLen2
      arrStr2(j - 1) = Asc(Mid(str2, j, 1))
  Next

  For i = 1 To intLen1
      intDB = 0

      For j = 1 To intLen2
          intI1 = arrDA(arrStr2(j - 1))
          intJ1 = intDB

          If arrStr1(i - 1) = arrStr2(j - 1) Then
              intD = 0
          Else
              intD = 1
          End If

          If intD = 0 Then intDB = j

          intMini = arrDistance(i, j) + intD
          If intMini > arrDistance(i + 1, j) + 1 Then intMini = arrDistance(i + 1, j) + 1
          If intMini > arrDistance(i, j + 1) + 1 Then intMini = arrDistance(i, j + 1) + 1
          If intMini > arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1 Then intMini = arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1

          arrDistance(i + 1, j + 1) = intMini
      Next

      arrDA(arrStr1(i - 1)) = i
  Next

  DamerauLevenshtein = arrDistance(intLen1 + 1, intLen2 + 1)
End Function

If I type in "Cire" the algorithm correctly returns "CORE".

"Raman" returns "REMAN" "Cosnigned" returns "CONSIGNED

However, "Now" should return "New" but returns "OCM".

"New" also returns "OCM" (so distance should be 0, but is 2.)

"FP" should be "FP" but returns "OCM", distance is 2

"DPF" Should be "DPF" but returns "OCM", distance is 2

I just learned about the algorithm, so I'm sure I'm missing something important, but I just can't see it. Thoughts?


Solution

  • I figured it out. After much searching I found a post saying that an edit distance is commonly 2. (They didn't specify any merits on why 2 is common)

    I switched my if statement to 2 from 4 and now all of the problem terms are being corrected as they should be.