Structuring large SQL rowset(s) and consuming in .NET
Asked Answered
S

2

6

Take a look at this psuedo schema (please note this is a simplification so please try not to comment too heavily on the "advisability" of the schema itself). Assume Indexes are inplace on the FKs.

 TABLE Lookup (
     Lookup_ID int not null PK
     Name nvarchar(255) not null
 )

 TABLE Document (
     Document_ID int not null PK
     Previous_ID null FK REFERENCES Document(Document_ID)
 )

 TABLE Document_Lookup (
     Document_ID int not null FK REFERENCES Document(Document_ID)
     Lookup_ID int not null FK REFERENCES Lookup(Lookup_ID)
 )

Volumes: Document, 4 Million rows of which 90% have a null Previous_ID field value; Lookup, 6000 rows, Average lookups attached to each document 20 giving Document_Lookup 80 Millions rows.

Now in a .NET Service have structure to represent a Lookup row like this:-

 struct Lookup
 {
      public int ID;
      public string Name;
      public List<int> DocumentIDs;
 }

and that lookup rows are stored in a Dictionary<int, Lookup> where the key is the lookup ID. An important point here is that this dictionary should contain entries where the Lookup is referenced by at least one document, i.e., the list DocumentIDs should have Count > 0.

My task is populate this dictionary efficiently. So the simple approach would be:-

  SELECT dl.Lookup_ID, l.Name, dl.Document_ID
  FROM Document_Lookup dl
  INNER JOIN Lookup l ON l.Lookup_ID = dl.Lookup_ID
  INNER JOIN Document d ON d.Document_ID = dl.Lookup_ID
  WHERE d.Previous_ID IS NULL
  ORDER BY dl.Lookup_ID, dl.Document_ID

This could then be used to populate a the dictionary fairly efficiently.

The Question: Does the underlying rowset delivery (TDS?) perform some optimization? It seems to me that queries that de-normalise data are very common hence the possiblity that field values don't change from one row to the next is high, hence it would make sense to optomise the stream by not sending field values that haven't changed. Does anyone know whether such an optomisation is in place? (Optomisation does not appear to exist).

What more sophisticated query could I use to eliminate the duplication (I'm think specifically of repeating the name value)? I've heard of such a thing a "nested rowset", can that sort of thing be generated? Would it be more performant? How would I access it in .NET?

I would perform two queries; one to populate the Lookup dictionary then a second to populate the ditionary lists. I would then add code to knock out the unused Lookup entires. However imagine I got my predictions wrong and Lookup ended up being 1 Million rows with only a quarter actually referenced by any document?

Soakage answered 22/11, 2011 at 12:40 Comment(5)
Re the TDS part of the query; not AFAIK, but you could test it by pulling 2M of the same string, vs pulling 2M of different strings (same length). Just a thought.Hippocrene
@Marc: D'oh that would be a sensible test that I should've done.Soakage
I know this doesn't help answer the question, but why do you need to retrieve the entire dataset into your structure? Getting just a small set of records that's needed right now would be far more optimal than making getting the whole lot a little faster.Edgar
@webturner: "would be far more optimal" I know this not be true. I have a dedicate server doing some really intensive stuff. I know for a fact that once this dictionary is built the performance of what I'm doing is massively better than the equivalent using SQL. Thats not in question, I just want to ensure a timely startup time for the rare occasions the service needs to be restarted. The quesiton carefully constructed to provide a view of a specific issue, its not an overview of all that I'm doing.Soakage
@Marc: Have done the test that I should have done in first place it appears you are correct. There appears to be no such optomisation.Soakage
K
1
  • As long as the names are relatively short in practice, the optimisation may not be necessary.

  • The easiest optimisation is to split it into two queries, one to get the names, the other to get the Document_ID list. (can be in the other order if it makes it easier to populate your data structures).

Example:

/*First get the name of the Lookup*/
select distinct dl.Lookup_ID, l.Name
FROM Document_Lookup dl 
INNER JOIN Lookup l ON l.Lookup_ID = dl.Lookup_ID 
INNER JOIN Document d ON d.Document_ID = dl.Lookup_ID 
WHERE d.Previous_ID IS NULL 
ORDER BY dl.Lookup_ID, dl.Document_ID 

/*Now get the list of Document_IDs for each*/
SELECT dl.Lookup_ID, dl.Document_ID 
FROM Document_Lookup dl 
INNER JOIN Lookup l ON l.Lookup_ID = dl.Lookup_ID 
INNER JOIN Document d ON d.Document_ID = dl.Lookup_ID 
WHERE d.Previous_ID IS NULL 
ORDER BY dl.Lookup_ID, dl.Document_ID 
  • There are also various tricks you could use to massage these into a single table but I suggest these are not worthwile.

  • The heirarchical rowsets you are thinking of are the MSDASHAPE OLEDB provider. They can do what you are suggesting but would restrict you to using the OLEDB provider for SQL which may not be what you want.

  • Finally consider careful XML

For example:

select
  l.lookup_ID as "@l", 
  l.name as "@n",
  (
    select dl.Document_ID as "node()", ' ' as "node()" 
    from Document_Lookup dl where dl.lookup_ID = l.lookup_ID for xml path(''), type
  ) as "*"
  from Lookup l
  where l.lookup_ID in (select dl.lookup_ID from Document_Lookup dl)
  for xml path('dl')

returns:

<dl l="1" n="One">1 2 </dl>
<dl l="2" n="Two">2 </dl>
Kermis answered 24/1, 2012 at 13:54 Comment(0)
I
0

When you're asking about "nested rowsets" are you referring to using the DbDataReader.NextResult() method?

if your query has two "outputs" (two select statements which return a separate resultsets), you can loop through the first using DbDataReader.Next() and when that returns "false" then you can call DbDataReader.NextResult() and then use DbDataReader.Next() again to continue.

var reader = cmd.ExecuteReader();
  while(reader.Read()){
    // load data
  }

  if(reader.NextResult()){
    while(reader.Read()){
      // lookup record from first result

      // load data from second result
    }
  }

I've done this frequently to reduce duplicate data in a similar situation and it works really well:

SELECT * FROM tableA WHERE [condition]
SELECT * FROM tableB WHERE EXISTS (SELECT * FROM tableA WHERE [condition] AND tableB.FK = tableA.PK)

Disclaimer: I haven't tried this with a resultset as large as you're describing.

The downside of this is you'll need a way to map the second resultset to the first, using a hashtable or order list.

Incurrent answered 23/11, 2011 at 15:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.