I am trying to convert this Levenshtein Distance algorithm from MySQL to SQL Server.
I am hung up on the CONCAT(@cv1, UNHEX(HEX(@j)))
and CONV(HEX(SUBSTRING(@cv1, @j, 1)), 16, 10)
as I don't know equivalent functions in SQL Server for HEX(), UNHEX() and CONV(). I THINK I can use CONVERT()
but I am not entirely sure how.
Here is what I have:
CREATE FUNCTION fn_FuzzyMatch(
@str1 varchar(max),
@str2 varchar(max) )
RETURNS int
AS
BEGIN
DECLARE
@str1_len int,
@str2_len int,
@i int,
@j int,
@c int,
@c_temp int,
@cost int,
@str1_char char,
@cv0 varbinary(max),
@cv1 varbinary(max);
SELECT
@str1_len = SDU_Tools.StringLength(@str1),
@str2_len = SDU_Tools.StringLength(@str2),
@cv1 = 0x00,
@j = 1,
@i = 1,
@c = 0;
IF @str1 = @str2
RETURN 0;
ELSE IF @str1_len = 0
RETURN @str2_len;
ELSE IF @str2_len = 0
RETURN @str1_len;
ELSE
BEGIN
WHILE @j <= @str2_len
BEGIN
SET @cv1 = CONCAT(@cv1, UNHEX(HEX(@j)));
SET @j = @j + 1;
END;
WHILE @i <= @str1_len
BEGIN
SELECT
@str1_char = SUBSTRING(@str1, @i, 1),
@c = @i,
@cv0 = UNHEX(HEX(@i)),
@j = 1;
WHILE @j <= @str2_len
BEGIN
SET @c = @c + 1;
IF @str1_char = SUBSTRING(@str2, @j, 1)
SET @cost = 0;
ELSE
SET @cost = 1;
SET @c_temp = CONV(HEX(SUBSTRING(@cv1, @j, 1)), 16, 10) + @cost;
IF @c > @c_temp
SET @c = @c_temp;
SET @c_temp = CONV(HEX(SUBSTRING(@cv1, @j+1, 1)), 16, 10) + 1;
IF @c > @c_temp
SET @c = @c_temp;
SET @cv0 = CONCAT(@cv0, UNHEX(HEX(@c)))
SET @j = @j + 1;
END;
SET @cv1 = @cv0
SET @i = @i + 1;
END;
END;
RETURN @c;
END
Scroll down the addendum below.
Original answer:
The MySQL combination UNHEX(HEX(integer-value))
is converting a small integer ASCII code to an ASCII character. The SQL Server equivalent is CHAR(integer-value)
. This would convert a 0 to a NUL control character, 1, to a SOH, and 55 to the ASCII digit "7".
The MySQL combination CONV(HEX(character-value), 16, 10)
is doing the reverse, converting an ASCII character to the ASCII character code. The SQL Server equivalent is ASCII(character-value)
. This would convert a NUL control character to a 0, a SOH to a 1, and the ASCII digit '7' to 55.
It looks like your function is using these operations to map integers to and from a character string that simulates an array or integers. Because these values seem to be related to input string lengths, the algorithm is likely limited to strings no longer than 255 characters. I haven't analyzed their usage beyond that.
See the following: fiddle1 and fiddle2.
Addendum:
After another look, I see that the MySQL code is also performing implicit conversions between char and binary. My revised answer follows.
The MySQL combination UNHEX(HEX(integer-value))
is converting a small integer ASCII code to an ASCII character that is then implicitly converted to a binary value. The SQL Server equivalent is CONVERT(BINARY(1), integer-value)
. This would convert a 0 to a 0x00, a 1 to a 0x01, and 255 to a 0xFF.
The MySQL combination CONV(HEX(binary-value), 16, 10)
is doing the reverse, implicitly converting the binary to a character, and then that character to the ASCII character code. The SQL Server equivalent is CONVERT(INT, binary(1)-value)
. This would convert a 0x00 to a 0, a 0x01 to a 1, and a 0xff to a 255.
It looks like your function is using these operations to map integers to and from a binary string that simulates an array or integers. Because these values are limited to 255 and seem to be related to input string lengths, the algorithm is likely limited to strings no longer than 255 characters.
See this db<>fiddle that applies both versions of the above substitutions (and a few minimal other changes) to your function. It also includes two of the solutions from the linked Levenshtein distance in T-SQL question identified by DaleK in the comments.
Sample results:
String1 | String2 | fn_FuzzyMatch1 | fn_FuzzyMatch2 | Levenshtein | edit_distance |
---|---|---|---|---|---|
String1 | String2 | 1 | 1 | 1 | 1 |
Gigantic | Titanic | 3 | 3 | 3 | 3 |
Apples | Oranges | 5 | 5 | 5 | 5 |