Essentially, my question boils down to observing these two query plans (from SQL Server 2019):
I defined a scalar function to parse out a string representation of an IP address into the binary(16)
value of the IPv6 address (mapping IPv4 addresses over to IPv6). I first implemented it in SQL, but then I also implemented it in C# using the built-in IPAddress
class to parse the value. I'm attempting to use this function to join a table containing IP address strings to a table with a list of parsed out CIDR blocks (clustered index on [Start]
and [End]
binary(16)
values).
There were two ways that I wrote the SQL query:
JOIN
criteria.SELECT
*
FROM
[dbo].[values]
val
LEFT JOIN
[dbo].[cidr]
cidr
ON [dbo].[fn_ParseIP](val.[IpAddress]) BETWEEN cidr.[Start] AND cidr.[End]
;
APPLY
clause before referencing it in the JOIN
.SELECT
val.*, cidr.*
FROM
[dbo].[values]
val
CROSS APPLY
(
SELECT [ParsedIpAddress] = [dbo].[fn_ParseIP](val.[IpAddress])
)
calc
LEFT JOIN
[dbo].[cidr]
cidr
ON calc.[ParsedIpAddress] BETWEEN cidr.[Start] AND cidr.[End]
;
In my testing, [dbo].[values]
contained 17 rows (and was defined using VALUES
rather than being an actual table) and [dbo].[cidr]
contained 986,320 rows.
When using a scalar function defined in SQL, query 1 takes about 7.5 minutes to run and query 2 takes under 1 second.
When using a CLR scalar function, both queries take about 2.5 minutes to run, but query 2 has an extra node in the query plan to compute the scalar function after the join.
The difference is ultimately that when referencing the scalar function that was defined in SQL, I'm able to get it to generate the first plan where it computes the results of the scalar function first and then uses those as the seek predicate into the [dbo].[cidr]
clustered index when performing the join. But when using the CLR function, it always performs the calculation as part of the clustered index seek (and filter), so it is running the function significantly more often to get the results.
My assumption is that this behavior could be due to the query planner believing the function to be nondeterministic, but I have the CLR function implemented with the following attributes:
[SqlFunction(
DataAccess = DataAccessKind.None,
IsDeterministic = true,
IsPrecise = true
)]
[return: SqlFacet(IsFixedLength = true, IsNullable = true, MaxSize = 16)]
public static SqlBinary fn_CLR_ParseIP([SqlFacet(MaxSize = 50)] SqlString ipAddress)
{ }
My hope was that I could rely on the .NET standard library to deal with the IP address parsing for me in SQL. Currently, we have some processes that work with IPv4 addresses only, and I need to update them to work with IPv6 as well. In some of our large databases, this processing is very slow, so I was hoping that the parsing logic in .NET would be more efficient. It seems like the CLR function itself is faster than my SQL implementation, but the effect on the query plan is significantly worse.
I can likely rewrite the query to use temporary tables to parse the IP addresses out first, and that should resolve this issue. I can also get decent results when defining an equivalent CLR table-valued function that returns just a single row.
However, I'd like to know if there is something that I'm missing that would make it easier to use a CLR scalar function as part of a filter predicate. Is it just a bad idea and I should proceed with some of these alternatives, or is there something I that I could do that would make it easier to work with the CLR function as a drop-in replacement for the SQL function?
For anybody that is interested, here is the final query that is performing well using the concept given in T N's answer.
SELECT
*
FROM
[dbo].[values]
val
CROSS APPLY
(
SELECT [IpAddress] = [dbo].[fn_CLR_ParseIP](val.[IpAddress])
)
parsed
OUTER APPLY
(
SELECT TOP (1)
*
FROM
[dbo].[cidr]
_cidr
WHERE
_cidr.[range_start] <= parsed.[IpAddress]
AND _cidr.[range_end] >= parsed.[IpAddress]
ORDER BY
_cidr.[range_start] DESC
)
cidr
;
If your cidr
table has rows with no overlapping Start/End
ranges, so that for any given IP address there will be at most one matching row, you can:
SELECT TOP 1 ... ORDER BY ...
logic to efficiently identify a single candidate row based on the Start
column.End
column value.OUTER APPLY
, with the TOP 1
further wrapped up inside a nested subselect.WHERE [End] ...
test must be performed after the TOP 1
operation to prevent scanning of additional rows for the case where the (first) candidate row does not match.OUTER APPLY (
SELECT candidate.*
FROM (
SELECT TOP 1
FROM cidr
WHERE cidr.Start <= calc.ParsedIpAddress
ORDER BY cidr.Start DESC
) candidate
WHERE candidate.[End] >= calc.ParsedIpAddress
) match
An index on either cidr(Start)
, cidr(Start, [End])
, or cidr(Start) INCLUDE([End])
would be needed for efficient operation.
If your source cidr
table can have arbitrarily overlapping ranges such that a given IP can potentially have multiple matches, then I know of no simple & efficient lookup method that avoids a range scan. An index on cidr(Start, [End])
would at least perform the range scan on a limited-width (more compact) index rather than to the presumable wider table (clustered index). That might provide some performance improvement.
A more involved solution might involve segregating/partitioning the cidr
table based on the number of fixed octets (like the original/obsolete class A, B, C, D subnets) or the number of fixed bits (/16, /24, etc) - up to 16 octets or 128 bits for IPV6. You would then optimize the lookups into each partition and UNION the results. This would almost simulate the way SQL server implements spacial indexes.