sql-server xml performance sql-optimization

Sql Server: Selective XML Index not being efficiently used

I'm exploring ways of improving the performance of an application which I can only affect on the database level to a limited degree. The SQL Server version is 2012 SP2 and the table and view structure in question is (I cannot really affect this + note that the xml document may have several hundred elements in total):

CREATE TABLE Orders(
    id nvarchar(64) NOT NULL,
    xmldoc xml NULL,
    CONSTRAINT PK_Order_id PRIMARY KEY CLUSTERED (id)
);

CREATE VIEW V_Orders as
SELECT 
    a.id, a.xmldoc
    ,a.xmldoc.value('data(/row/c1)[1]', 'nvarchar(max)') "Stuff"
    ,a.xmldoc.value('data(/row/c2)[1]', 'nvarchar(max)') "OrderType"
etc..... many columns
from Orders a;

A typical query (and the one being used for testing below):

SELECT id FROM V_Orders WHERE OrderType = '30791'

All the queries are performed against the view and I can affect neither the queries nor the table/view structure.

I thought adding a selective XML index to the table would be my saviour:

CREATE SELECTIVE XML INDEX I_Orders_OrderType ON Orders(xmldoc)
FOR(
    pathOrderType = '/row/c2' as SQL [nvarchar](20)
)

But even after updating the statistics the execution plan is looking weird. Couldn't post a pic as new account so the relevant details as text:

Clustered index seek from selectiveXml (Cost: 2% of total). Expected number of rows 1 but expected number of execution times 1269 (number of rows in the table)
-> Top N sort (Cost: 95% of total)
-> Compute scalar (Cost 0)
Separate branch: Clustered index scan PK_Order_id (Cost: 3% of total). Expected number of rows 1269
-> Merged to the Computer scalar results with Nested loops (Left outer join)
-> Filter
-> Final result (Expected number of rows 1269)

In actuality with my test data the query doesn't even return any results but whether it returns one or few doesn't make any difference. Execution times support the query really taking as long as could be deduced from the execution plan and have read counts in the thousands.

So my question is why is the selective xml index not being used properly by the optimizer? Or have I got something wrong? How would I optimize this specific query's performance with selective xml indexing (or perhaps persisted column)?

Edit: I did additional testing with larger sample data (~274k rows in the table with XML documents close to average production sizes) and compared the selective XML index to a promoted column. The results are from Profiler trace, concentrating on CPU usage and read counts. The execution plan for selective xml indexing is basically identical to what is described above.

Selective XML index and 274k rows (executing the query above): CPU: 6454, reads: 938521

After I updated the values in the searched field to be unique (total records still 274k) I got the following results:

Selective XML index and 274k rows (executing the query above): CPU: 10077, reads: 1006466

Then using a promoted (i.e. persisted) separately indexed column and using it directly in the view: CPU: 0, reads: 23

Selective XML index performance seems to be closer to full table scan than proper SQL indexed column fetch. I read somewhere that using schema for the table might help drop the TOP N step from execution plan (assuming we're searching for a non-repeating field) but I'm not sure whether that's a realistic possibility in this case.

Solution

The selective XML index you create is stored in an internal table with the primary key from Orders as the leading column for the clustered key for the internal table and the paths specified stored as sparse columns.

The query plan you get probably looks a something like this:

enter image description here

You have a scan over the entire Orders table with a seek in the internal table on the primary key for each row in Orders. The final Filter operator is responsible for checking the value of OrderType returning only the matching rows.

Not really what you would expect from something called an index.

To the rescue comes a secondary selective XML index. They are created for one of the paths specified in the primary selective index and will create a non-clustered key on the values extracted in the path expression.

It is however not all that easy. SQL Server will not use the secondary index on predicates used on values extracted by the values() function. You have to use exists() instead. Also, exists() requires the use of XQUERY data types in the path expressions where value() uses SQL data types.

Your primary selective XML index could look like this:

CREATE SELECTIVE XML INDEX I_Orders_OrderType ON Orders(xmldoc)
FOR 
(
  pathOrderType = '/row/c2' as sql nvarchar(20), 
  pathOrderTypeX = '/row/c2/text()' as xquery 'xs:string' maxlength (20)
)

With a secondary on pathOrderTypeX.

CREATE XML INDEX I_Orders_OrderType2 ON Orders(xmldoc)
  USING XML INDEX I_Orders_OrderType FOR (pathOrderTypeX)

And with a query that uses exist() you will get this plan.

select id
from V_Orders
where xmldoc.exist('/row/c2/text()[. = "30791"]') = 1

enter image description here

The first seek is a seek for the value you are looking for in the non-clustered index of the internal table. The key lookup is done on the clustered key on the internal table (don't know why that is necessary). And the last seek is on primary key in the Orders table followed by a filter that checks for null values in the column xmldoc.

If you can get away with using property promotion, creating calculated indexed columns in the Orders table from the XML, I guess you would still get better performance than using secondary selective XML indexes.