sql-serverperformancesql-server-2008-r2sqlxmlcross-apply

cross apply xml query performs exponentially worse as xml document grows


What I Have

I have a variable size XML document that needs to be parsed on MSSQL 2008 R2 that looks like this:

<data item_id_type="1" cfgid="{4F5BBD5E-72ED-4201-B741-F6C8CC89D8EB}" has_data_event="False">
  <item name="1">
    <field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.506543009706267</field>
    <field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.79500402346138</field>
    <field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.0152649050024924</field>
  </item>
  <item name="2">
    <field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.366096802804087</field>
    <field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.386642801354842</field>
    <field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.031671174184115</field>
  </item>
</data>

.

What I Want

I need to transform it into a regular table type dataset that looks like this:

item_name field_id                             field_type  field_value
--------- ------------------------------------ ----------- ---------------
1         EA032B25-19F1-4C1B-BDDE-3113542D13A5 2           0.5065430097062
1         71014ACB-571B-4C72-9C9B-05458B11335F 2           -0.795004023461
1         740C36E9-1988-413E-A1D5-B3E5B4405B45 2           0.0152649050024
2         EA032B25-19F1-4C1B-BDDE-3113542D13A5 2           0.3660968028040
2         71014ACB-571B-4C72-9C9B-05458B11335F 2           -0.386642801354
2         740C36E9-1988-413E-A1D5-B3E5B4405B45 2           0.0316711741841
3         EA032B25-19F1-4C1B-BDDE-3113542D13A5 2           0.8839620369590
3         71014ACB-571B-4C72-9C9B-05458B11335F 2           -0.781459993268
3         740C36E9-1988-413E-A1D5-B3E5B4405B45 2           0.2284423515729

.

What Works

This cross apply query creates the desired output:

create table #temp (x xml)

insert into #temp (x)
values ('
<data item_id_type="1" cfgid="{4F5BBD5E-72ED-4201-B741-F6C8CC89D8EB}" has_data_event="False">
  <item name="1">
    <field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.506543009706267</field>
    <field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.79500402346138</field>
    <field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.0152649050024924</field>
  </item>
  <item name="2">
    <field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.366096802804087</field>
    <field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.386642801354842</field>
    <field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.031671174184115</field>
  </item>
  <item name="3">
    <field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.883962036959074</field>
    <field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.781459993268713</field>
    <field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.228442351572923</field>
  </item>
</data>
')

select c.value('(../@name)','varchar(5)') as item_name
      ,c.value('(@id)','uniqueidentifier') as field_id
      ,c.value('(@type)','int') as field_type
      ,c.value('(.)','nvarchar(15)') as field_value
from   #temp cross apply
       #temp.x.nodes('/data/item/field') as y(c)

drop table #temp

.

Problem

When there are a few hundred (or fewer) <item> elements in the XML, the query performs just fine. However, when there are 1,000 <item> elements, it takes 24 seconds to finish returning the rows in SSMS. When there are 6,500 <item> elements, it takes about 20 minutes to run the cross apply query. We could have 10-20,000 <item> elements.

.

Questions

What makes the cross apply query perform so poorly on this simple XML document, and perform exponentially slower as the dataset grows?

Is there a more efficient way to transform the XML document into the tabular dataset (in SQL)?


Solution

  • What makes the cross apply query perform so poorly on this simple XML document, and perform exponentially slower as the dataset grows?

    It Is the use of the parent axis to get the attribute ID from the item node.

    It is this part of the query plan that is problematic.

    enter image description here

    Notice the 423 rows coming out of the lower Table-valued function.

    Adding just one more item node with three field nodes gives you this.

    enter image description here

    732 rows returned.

    What if we double the nodes from the first query to a total of 6 item nodes?

    enter image description here

    We are up to a whopping 1602 row returned.

    The figure 18 in the top function is all field nodes in your XML. We have here 6 items with three fields in each item. Those 18 nodes are used in a nested loops join against the other function so 18 executions returning 1602 rows gives that it is returning 89 rows per iteration. That just happens to be the exact number of nodes in the entire XML. Well it is actually one more than all the visible nodes. I don't know why. You can use this query to check the total number of nodes in your XML.

    select count(*)
    from @XML.nodes('//*, //@*, //*/text()') as T(X)  
    

    So the algorithm used by SQL Server to get the value when you use the parent axis .. in a values function is that it first finds all the nodes you are shredding on, 18 in the last case. For each of those nodes it shreds and returns the entire XML document and checks in the filter operator for the node you actually want. There you have your exponential growth. Instead of using the parent axis you should use one extra cross apply. First shred on item and then on field.

    select I.X.value('@name', 'varchar(5)') as item_name,
           F.X.value('@id', 'uniqueidentifier') as field_id,
           F.X.value('@type', 'int') as field_type,
           F.X.value('text()[1]', 'nvarchar(15)') as field_value
    from #temp as T
      cross apply T.x.nodes('/data/item') as I(X)
      cross apply I.X.nodes('field') as F(X)
    

    I also changed how you access the text value of the field. Using . will make SQL Server go look for child nodes to field and concatenate those values in the result. You have no child values so the result is the same but it is a good thing to avoid to have that part in the query plan (the UDX operator).

    The query plan does not have the issue with the parent axis if you are using an XML index but you will still benefit from changing how you fetch the field value.