Take a look at this psuedo schema (please note this is a simplification so please try not to comment too heavily on the "advisability" of the schema itself). Assume Indexes are inplace on the FKs.
TABLE Lookup (
Lookup_ID int not null PK
Name nvarchar(255) not null
)
TABLE Document (
Document_ID int not null PK
Previous_ID null FK REFERENCES Document(Document_ID)
)
TABLE Document_Lookup (
Document_ID int not null FK REFERENCES Document(Document_ID)
Lookup_ID int not null FK REFERENCES Lookup(Lookup_ID)
)
Volumes: Document, 4 Million rows of which 90% have a null Previous_ID field value; Lookup, 6000 rows, Average lookups attached to each document 20 giving Document_Lookup 80 Millions rows.
Now in a .NET Service have structure to represent a Lookup row like this:-
struct Lookup
{
public int ID;
public string Name;
public List<int> DocumentIDs;
}
and that lookup rows are stored in a Dictionary<int, Lookup>
where the key is the lookup ID. An important point here is that this dictionary should contain entries where the Lookup is referenced by at least one document, i.e., the list DocumentIDs
should have Count > 0.
My task is populate this dictionary efficiently. So the simple approach would be:-
SELECT dl.Lookup_ID, l.Name, dl.Document_ID
FROM Document_Lookup dl
INNER JOIN Lookup l ON l.Lookup_ID = dl.Lookup_ID
INNER JOIN Document d ON d.Document_ID = dl.Lookup_ID
WHERE d.Previous_ID IS NULL
ORDER BY dl.Lookup_ID, dl.Document_ID
This could then be used to populate a the dictionary fairly efficiently.
The Question: Does the underlying rowset delivery (TDS?) perform some optimization? It seems to me that queries that de-normalise data are very common hence the possiblity that field values don't change from one row to the next is high, hence it would make sense to optomise the stream by not sending field values that haven't changed. Does anyone know whether such an optomisation is in place? (Optomisation does not appear to exist).
What more sophisticated query could I use to eliminate the duplication (I'm think specifically of repeating the name value)? I've heard of such a thing a "nested rowset", can that sort of thing be generated? Would it be more performant? How would I access it in .NET?
I would perform two queries; one to populate the Lookup dictionary then a second to populate the ditionary lists. I would then add code to knock out the unused Lookup entires. However imagine I got my predictions wrong and Lookup ended up being 1 Million rows with only a quarter actually referenced by any document?