I have a complex SQL Server query that needs to process around 2000 parameters at a time. My query is complex, so for simplicity suppose we have:
select *
from sample_table
where id in (:listIds);
In the Java code, I insert listIds using from hibernate.query .setParameterList("listIds", listIds)
, where listIds is of type List<String>
and has around 2000 entries. This query takes too long and will sometimes timeout.
I found this answer, where the author uses openjson
and mentions improved performance. I followed a similar pattern:
select *
from sample_table
where id in (select cast(value as NVARCHAR(20)) from openjson(:listIds);
In the java code, I convert List<String> listIds
into a JSON array of strings (e.g. ["a", "b", "c"]
) of length 2000. I then use .setParameter("listIds", listIds)
to insert listIds. This query finishes a lot faster than the previous approach.
My question is why does the latter approach seem to improve performance?
I tried searching for explanations on openjson
vs .setParameterList()
. I'm still unclear about why performance improved and was hoping someone could break this down for me.
There are several approaches to querying a large list of id values:
... WHERE id in (1,2,3,4,...39203) -- explicit list of values
... WHERE id in (@p1, @p2, @p3... @p2000) -- list of parameters
... WHERE id in (select .. FROM @tabledValueParameter ) -- single table value parameter.
... WHERE id in (select ... FROM STRING_SPLIT(@ids,...)) -- single varchar "csv" parameter
... WHERE id in (SELECT ... FROM OPENJSON(@ids,...) -- single varchar JSON parameter
The first approach has the drawback of requiring a parse and compile of a new plan for every query execution. These are large plans. This takes time, causes plan cache bloat, wastes memory and will make your neighborhood DBA sad.
The second approach (default behavior for java/jpa) is marginally better. It requires a parse and compile for each number of parameters. e.g. two queries with the same number of parameters can share the same plan. The downside is that the amount of data sent to the server is increased, as the query is large, the parameter definition list is large and the parameter value list is large as well. Depending on the query, this can double or triple the size of the data sent to SQL Server. Also there is a limit of 2100 parameters. Less sad DBA.
The third approach uses a table valued parameter. This is Microsoft's "approved" way to send multiple values to SQL server. The advantage is the query text stays constant regardless of the number of values, so there is only one parse/compile and one query plan that gets used. Supports an arbitrary number of values (100k+ is no issue). The downside is that table valued parameters are stored in tempdb, so potential for a whole new class of headaches in a high volume environment. Better performance than the first two approaches over a certain number of values (~500 in my testing, YMMV). Worse performance for a small number of values (~<50) due to tempdb allocations. DBA will be either fine or really not fine with this approach depending on tempdb contention.
The last two approach have effectively the same performance characteristics. The query text is constant, so plan re-use is great. The query plan is small. Tempdb isn't involved, so no concerns there. Both OPENJSON() and STRING_SPLIT() are very efficiently implemented in C++. Performance meets or exceeds the other approaches for small numbers and is substantially better as the number of values grow (>250 values).
STRING_SPLIT() is a good choice for "simple" data types, e.g. a list of ints.
OPENJSON() is a better choice if you need to support ints, strings, dates, etc, or pass in complex objects.