sqlsql-serverlevenshtein-distance

Levenshtein Distance but what to use for CONV, HEX, UNHEX for Characters


I am trying to convert this Levenshtein Distance algorithm from MySQL to SQL Server.

I am hung up on the CONCAT(@cv1, UNHEX(HEX(@j))) and CONV(HEX(SUBSTRING(@cv1, @j, 1)), 16, 10) as I don't know equivalent functions in SQL Server for HEX(), UNHEX() and CONV(). I THINK I can use CONVERT() but I am not entirely sure how.

Here is what I have:

CREATE FUNCTION fn_FuzzyMatch(
        @str1   varchar(max),
        @str2   varchar(max) )
    RETURNS int
AS
BEGIN
    DECLARE 
        @str1_len   int, 
        @str2_len   int, 
        @i          int, 
        @j          int, 
        @c          int, 
        @c_temp     int, 
        @cost       int,
        @str1_char  char,
        @cv0        varbinary(max), 
        @cv1        varbinary(max);

    SELECT
        @str1_len = SDU_Tools.StringLength(@str1),
        @str2_len = SDU_Tools.StringLength(@str2),
        @cv1 = 0x00, 
        @j = 1, 
        @i = 1, 
        @c = 0;

    IF @str1 = @str2
        RETURN 0;
    ELSE IF @str1_len = 0
        RETURN @str2_len;
    ELSE IF @str2_len = 0
        RETURN @str1_len;
    ELSE
    BEGIN
        WHILE @j <= @str2_len
        BEGIN
            SET @cv1 = CONCAT(@cv1, UNHEX(HEX(@j)));
            SET @j = @j + 1;
        END; 

        WHILE @i <= @str1_len
        BEGIN
            SELECT
                @str1_char = SUBSTRING(@str1, @i, 1),
                @c = @i, 
                @cv0 = UNHEX(HEX(@i)),
                @j = 1;

            WHILE @j <= @str2_len
            BEGIN
                SET @c = @c + 1;
                IF @str1_char = SUBSTRING(@str2, @j, 1)
                    SET @cost = 0; 
                ELSE
                    SET @cost = 1;

                SET @c_temp = CONV(HEX(SUBSTRING(@cv1, @j, 1)), 16, 10) + @cost;
                IF @c > @c_temp
                    SET @c = @c_temp;
                SET @c_temp = CONV(HEX(SUBSTRING(@cv1, @j+1, 1)), 16, 10) + 1;
                IF @c > @c_temp
                    SET @c = @c_temp;
                SET @cv0 = CONCAT(@cv0, UNHEX(HEX(@c)))
                SET @j = @j + 1;
            END;
            SET @cv1 = @cv0
            SET @i = @i + 1;
        END;
    END;
    
    RETURN @c;
END

Solution

  • Scroll down the addendum below.

    Original answer:

    The MySQL combination UNHEX(HEX(integer-value)) is converting a small integer ASCII code to an ASCII character. The SQL Server equivalent is CHAR(integer-value). This would convert a 0 to a NUL control character, 1, to a SOH, and 55 to the ASCII digit "7".

    The MySQL combination CONV(HEX(character-value), 16, 10) is doing the reverse, converting an ASCII character to the ASCII character code. The SQL Server equivalent is ASCII(character-value). This would convert a NUL control character to a 0, a SOH to a 1, and the ASCII digit '7' to 55.

    It looks like your function is using these operations to map integers to and from a character string that simulates an array or integers. Because these values seem to be related to input string lengths, the algorithm is likely limited to strings no longer than 255 characters. I haven't analyzed their usage beyond that.

    See the following: fiddle1 and fiddle2.

    Addendum:

    After another look, I see that the MySQL code is also performing implicit conversions between char and binary. My revised answer follows.

    The MySQL combination UNHEX(HEX(integer-value)) is converting a small integer ASCII code to an ASCII character that is then implicitly converted to a binary value. The SQL Server equivalent is CONVERT(BINARY(1), integer-value). This would convert a 0 to a 0x00, a 1 to a 0x01, and 255 to a 0xFF.

    The MySQL combination CONV(HEX(binary-value), 16, 10) is doing the reverse, implicitly converting the binary to a character, and then that character to the ASCII character code. The SQL Server equivalent is CONVERT(INT, binary(1)-value). This would convert a 0x00 to a 0, a 0x01 to a 1, and a 0xff to a 255.

    It looks like your function is using these operations to map integers to and from a binary string that simulates an array or integers. Because these values are limited to 255 and seem to be related to input string lengths, the algorithm is likely limited to strings no longer than 255 characters.

    See this db<>fiddle that applies both versions of the above substitutions (and a few minimal other changes) to your function. It also includes two of the solutions from the linked Levenshtein distance in T-SQL question identified by DaleK in the comments.

    Sample results:

    String1 String2 fn_FuzzyMatch1 fn_FuzzyMatch2 Levenshtein edit_distance
    String1 String2 1 1 1 1
    Gigantic Titanic 3 3 3 3
    Apples Oranges 5 5 5 5