vb6hash-code-uniquenesshash-function

Why are the hash codes generated by this function not unique?


I'm testing the VB function below that I got from a Google search. I plan to use it to generate hash codes for quick string comparison. However, there are occasions in which two different strings have the same hash code. For example, these strings

"122Gen 1 heap size (.NET CLR Memory w3wp):mccsmtpteweb025.20833333333333E-02"

"122Gen 2 heap size (.NET CLR Memory w3wp):mccsmtpteweb015.20833333333333E-02"

have the same hash code of 237117279.

Please tell me: - What is wrong with the function? - How can I fix it?

Thank you

martin


Private Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" (dest As Any, src As Any, ByVal bytes As Long)

Private Function HashCode(Key As String) As Long
  On Error GoTo ErrorGoTo

  Dim lastEl As Long, i As Long
  ' copy ansi codes into an array of long'
  lastEl = (Len(Key) - 1) \ 4
  ReDim codes(lastEl) As Long
  ' this also converts from Unicode to ANSI'
  CopyMemory codes(0), ByVal Key, Len(Key)
  ' XOR the ANSI codes of all characters'

  For i = 0 To lastEl - 1
    HashCode = HashCode Xor codes(i) 'Xor'
  Next

ErrorGoTo:
  Exit Function
End Function

Solution

  • I'm betting there are more than just "occasions" when two strings generate the same hash using your function. In fact, it probably happens more often than you think.

    A few things to realize:

    First, there will be hash collisions. It happens. Even with really, really big spaces like MD5 (128 bits) there are still two strings that can generate the same resulting hash. You have to deal with those collisions by creating buckets.

    Second, a long integer isn't really a big hash space. You're going to get more collisions than you would if you used more bits.

    Thirdly, there are libraries available to you in Visual Basic (like .NET's System.Security.Cryptography namespace) that will do a much better job of hashing than most mere mortals.