cross apply xml query performs exponentially worse as xml document grows
Asked Answered
I

2

14

What I Have

I have a variable size XML document that needs to be parsed on MSSQL 2008 R2 that looks like this:

<data item_id_type="1" cfgid="{4F5BBD5E-72ED-4201-B741-F6C8CC89D8EB}" has_data_event="False">
  <item name="1">
    <field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.506543009706267</field>
    <field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.79500402346138</field>
    <field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.0152649050024924</field>
  </item>
  <item name="2">
    <field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.366096802804087</field>
    <field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.386642801354842</field>
    <field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.031671174184115</field>
  </item>
</data>

.

What I Want

I need to transform it into a regular table type dataset that looks like this:

item_name field_id                             field_type  field_value
--------- ------------------------------------ ----------- ---------------
1         EA032B25-19F1-4C1B-BDDE-3113542D13A5 2           0.5065430097062
1         71014ACB-571B-4C72-9C9B-05458B11335F 2           -0.795004023461
1         740C36E9-1988-413E-A1D5-B3E5B4405B45 2           0.0152649050024
2         EA032B25-19F1-4C1B-BDDE-3113542D13A5 2           0.3660968028040
2         71014ACB-571B-4C72-9C9B-05458B11335F 2           -0.386642801354
2         740C36E9-1988-413E-A1D5-B3E5B4405B45 2           0.0316711741841
3         EA032B25-19F1-4C1B-BDDE-3113542D13A5 2           0.8839620369590
3         71014ACB-571B-4C72-9C9B-05458B11335F 2           -0.781459993268
3         740C36E9-1988-413E-A1D5-B3E5B4405B45 2           0.2284423515729

.

What Works

This cross apply query creates the desired output:

create table #temp (x xml)

insert into #temp (x)
values ('
<data item_id_type="1" cfgid="{4F5BBD5E-72ED-4201-B741-F6C8CC89D8EB}" has_data_event="False">
  <item name="1">
    <field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.506543009706267</field>
    <field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.79500402346138</field>
    <field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.0152649050024924</field>
  </item>
  <item name="2">
    <field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.366096802804087</field>
    <field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.386642801354842</field>
    <field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.031671174184115</field>
  </item>
  <item name="3">
    <field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.883962036959074</field>
    <field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.781459993268713</field>
    <field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.228442351572923</field>
  </item>
</data>
')

select c.value('(../@name)','varchar(5)') as item_name
      ,c.value('(@id)','uniqueidentifier') as field_id
      ,c.value('(@type)','int') as field_type
      ,c.value('(.)','nvarchar(15)') as field_value
from   #temp cross apply
       #temp.x.nodes('/data/item/field') as y(c)

drop table #temp

.

Problem

When there are a few hundred (or fewer) <item> elements in the XML, the query performs just fine. However, when there are 1,000 <item> elements, it takes 24 seconds to finish returning the rows in SSMS. When there are 6,500 <item> elements, it takes about 20 minutes to run the cross apply query. We could have 10-20,000 <item> elements.

.

Questions

What makes the cross apply query perform so poorly on this simple XML document, and perform exponentially slower as the dataset grows?

Is there a more efficient way to transform the XML document into the tabular dataset (in SQL)?

Intellectualism answered 13/6, 2014 at 1:44 Comment(0)
W
38

What makes the cross apply query perform so poorly on this simple XML document, and perform exponentially slower as the dataset grows?

It Is the use of the parent axis to get the attribute ID from the item node.

It is this part of the query plan that is problematic.

enter image description here

Notice the 423 rows coming out of the lower Table-valued function.

Adding just one more item node with three field nodes gives you this.

enter image description here

732 rows returned.

What if we double the nodes from the first query to a total of 6 item nodes?

enter image description here

We are up to a whopping 1602 row returned.

The figure 18 in the top function is all field nodes in your XML. We have here 6 items with three fields in each item. Those 18 nodes are used in a nested loops join against the other function so 18 executions returning 1602 rows gives that it is returning 89 rows per iteration. That just happens to be the exact number of nodes in the entire XML. Well it is actually one more than all the visible nodes. I don't know why. You can use this query to check the total number of nodes in your XML.

select count(*)
from @XML.nodes('//*, //@*, //*/text()') as T(X)  

So the algorithm used by SQL Server to get the value when you use the parent axis .. in a values function is that it first finds all the nodes you are shredding on, 18 in the last case. For each of those nodes it shreds and returns the entire XML document and checks in the filter operator for the node you actually want. There you have your exponential growth. Instead of using the parent axis you should use one extra cross apply. First shred on item and then on field.

select I.X.value('@name', 'varchar(5)') as item_name,
       F.X.value('@id', 'uniqueidentifier') as field_id,
       F.X.value('@type', 'int') as field_type,
       F.X.value('text()[1]', 'nvarchar(15)') as field_value
from #temp as T
  cross apply T.x.nodes('/data/item') as I(X)
  cross apply I.X.nodes('field') as F(X)

I also changed how you access the text value of the field. Using . will make SQL Server go look for child nodes to field and concatenate those values in the result. You have no child values so the result is the same but it is a good thing to avoid to have that part in the query plan (the UDX operator).

The query plan does not have the issue with the parent axis if you are using an XML index but you will still benefit from changing how you fetch the field value.

Whirly answered 13/6, 2014 at 6:58 Comment(3)
Wish I could +5. Thanks for the detailed explanation. XML queries make a lot more sense now.Intellectualism
Well done! I was using OPENROWSET and the .query() method with horrible performance results. Your solution cut the entire processing time down to less than a minute!Fanfaron
+1 for advising to shred node by node (via cross apply), use text(), and including execution plan results.Marillin
I
3

Adding an XML index did the trick. Now the 6,500 records that took 20 minutes to run takes < 4 seconds.

create table #temp (id int primary key, x xml)
create primary xml index idx_x on #temp (x)

insert into #temp (id, x)
values (1, '
<data item_id_type="1" cfgid="{4F5BBD5E-72ED-4201-B741-F6C8CC89D8EB}" has_data_event="False">
  <item name="1">
    <field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.506543009706267</field>
    <field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.79500402346138</field>
    <field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.0152649050024924</field>
  </item>
  <item name="2">
    <field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.366096802804087</field>
    <field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.386642801354842</field>
    <field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.031671174184115</field>
  </item>
  <item name="3">
    <field id="{EA032B25-19F1-4C1B-BDDE-3113542D13A5}" type="2">0.883962036959074</field>
    <field id="{71014ACB-571B-4C72-9C9B-05458B11335F}" type="2">-0.781459993268713</field>
    <field id="{740C36E9-1988-413E-A1D5-B3E5B4405B45}" type="2">0.228442351572923</field>
  </item>
</data>
')

select c.value('(../@name)','varchar(5)') as item_name
      ,c.value('(@id)','uniqueidentifier') as field_id
      ,c.value('(@type)','int') as field_type
      ,c.value('(.)','nvarchar(15)') as field_value
from   #temp cross apply
       #temp.x.nodes('/data/item/field') as y(c)

drop table #temp
Intellectualism answered 13/6, 2014 at 4:46 Comment(5)
James, you may want to check your basics- table names and index names in tempdb (i.e. temporary) are prefixed, so 1000 users can have the same table with the same index and it will not clash. At least this is true for non-xml indices ;)Waggery
@Waggery - You are right. I remember having a problem with this back in the 90's. Can't remember which DBMS. Glad to know it's not a problem anymore!Intellectualism
@Waggery Your comment is oversimplified as it is only true for single-hash-prefixed #tempTables. If the temp table is named ##somethingLikeThis, i.e. prefixed with two hashes, it most certainly is global.Marillin
@JamesL. I created primary XML index on the column and it increases the time. Any other way to improve the performance?Rolypoly
@DharmendraKumar - Adding an index was enough to fix my performance issue, but it did not address the underlying issue -- to many rows are being returned because of how I pulled the data. Look at Mikael's answer. Use the query plan to see how many rows each step returns. Come up with a different plan to extract the data. The structure of your XML will dictate how you need to extract the data. I recommend posting your own question with a real sample of your XML so you can get specific help with your particular data.Intellectualism

© 2022 - 2024 — McMap. All rights reserved.