Query performance optimization for dynamically joined columns
Asked Answered
R

7

7

Current situation in SQL Server database

There is a table Entry with the following columns:

  • EntryID (int)
  • EntryName (nvarchar)
  • EntrySize (int)
  • EntryDate (datetime)

Further there should be the possibility to save additional metadata for an Entry. Names and values of these metadata should be free to choose and there should be the possibility to dynamically add those without changing the table structure of the database. Each metadata key can be one of the following data types:

  • Text
  • Numeric value
  • DateTime
  • Boolean value (True/False)

Thus there is a table DataKey to represent the metadata names and datatypes with the following columns:

  • DataKeyID (int)
  • DataKeyName (nvarchar)
  • DataKeyType (smallint) 0: Text; 1: Numeric; 2: DateTime; 3: Bit

In table DataValue for each combination of Entry and DataKey values can be inserted depending on the data type of the metadata key. For each data type there is one nullable value column. This table has the following columns:

  • DataValueID (int)
  • EntryID (int) Foreign-Key
  • DataKeyID (int) Foreign-Key
  • TextValue (nvarchar) Nullable
  • NumericValue (float) Nullable
  • DateValue (datetime) Nullable
  • BoolValue (bit) Nullable

Image of the database structure:

enter image description here

TARGET

Target is to retrieve a list of entries fulfilling the specifications like in a WHERE clause. Like the following example:

Assumption:

  • Meta data key KeyName1 is text
  • Meta data key KeyName2 is DateTime
  • Meta data key KeyName3 is numeric
  • Meta data key KeyName4 is Boolean

Query:

... WHERE (KeyName1 = „Test12345“ AND KeyName2 BETWEEN ’01.09.2012 00:00:00’ AND
’01.04.2013 23:59:00’) OR (KeyName3 > 15.3 AND KeyName4 = True)

Target is to do these queries in a very efficient way, also with a large amount of data like

  • Number of entries > 2.000.000
  • Number of data keys between 50 und 100 or maybe > 100
  • Per entry at least a subset of values specified or maybe also a value for each key (2.000.000 * 100)

PROBLEM

The first problem arises when building the query. Normally queries require to have sets with columns that can be used in the WHERE clause. In this case the columns used in the queries are entries in table DataKey as well to be able to dynamically add metadata without having to change the database table structure. During research a solution has been found using PIVOT table techniques at runtime. But it turned out that this solution is very slow when there is a large set of data in the database.

QUESTIONS

  • Is there a more efficient way or structure to save the data for this purpose?
  • How can the requirements listed above be fulfilled, also with regard to performance and time consumption when querying?

Here is a sql fiddle with the discribed database structure and some sample data: http://www.sqlfiddle.com/#!3/d1912/3

Rue answered 5/9, 2013 at 7:38 Comment(4)
This is a well written question, and although you have bypassed the "not strongly typed" argument of the EAV vs normalised form debate. There are literally thousands of articles/questions & answers/blogs comparing the two.Deathful
I am not sure, and never have used them myself, but maybe using sparse columns (see technet.microsoft.com/en-us/library/cc280604.aspx) for the additional properties in the main table would be an option? You would then not need any joins in your queries, and could have many, many property columns.Horsetail
Another option might be to put the properties into an XML column containing all the properties in one XML as elements or maybe even attributes of the root element.Horsetail
At the cost of some additional diskspace and query complexity, you could extract some of the EAV stuff into a normalised form (in a separate table), without sacrificing the flexibility you have. That's the sort of thing you can then update when you need to.Hulett
C
6

One of the fundamental flaws in an Entity Attribute Value design (which is what you have here) is the difficulty of efficient and performant querying.

The more efficient structure for storing data is to abandon EAV and use a normalised relational form. But that will necessarily involve changing the structure of the database when the data structures change (which should be self evident).

You could abandon your TextValue/NumericValue/DateValue/BoolValue fields and replace them with a single sql_variant column, which would reduce your query complexity slightly, but the fundamental problem will remain.

As a side note, storing all numerics as floats will cause problems if you ever have to deal with money.

Camel answered 5/9, 2013 at 7:52 Comment(0)
C
1

I dont feel qualified to comment on what is best, or on design approaches. In fact I'm inclined not to answer at all. However I have thought about your problem and that you've taken the time to describe it clearly, and this is how I would approach it.

I'd store each metadata datatype in its own table; So

Table MetaData_Text:
    ID int identity
    EntryID int
    KeyName nvarchar(50)
    KeyValue nvarchar(max)

MetaData_DateTime, MetaData_Boolean & MetaData_Numeric have the same structure as this, but with the appropriate different datatype of the KeyValue column in each case.

The relationship between an Entry & each of these tables is 0-Many; While every row in each of these tables belongs to one Entry.

To add a new metadata item for an entry, I'd just use a stored procedure taking EntryID, keyname & having optional parameters of possible metadata datatype:

 create procedure AddMetaData @entryid int, @keyname varchar(50), @textvalue varchar(max) = null, @datevalue datetime = null, @boolvalue bool = null, @numvalue float = null
 as ...

For querying, I would define a set of functions to manage each type of (a) metadata datatype & (b) test needing to be performed on that datatype, for example:

 create function MetaData_HasDate_EQ(@entryid int, @keyname varchar(50), @val datetime)
 returns bool
 as begin
     declare @rv bool
     select @rv = case when exists(
       select 1 from MetaData_DateTime where EntryID = @entryid and KeyName = @keyname and KeyValue = @val) then 1 else 0 end;
     return @rv
 end

and incorporate function references into required query logic as per

 SELECT ...
 FROM entry e ...
 WHERE (dbo.MetaData_HasText_EQ(e.EntryID, 'KeyName1', 'Test12345') <> 0
     AND dbo.MetaData_HasDate_Btwn(e.EntryID, 'KeyName2', '01.09.2012 00:00:00', '01.04.2013 23:59:00') <> 0)
   OR (dbo.MetaData_HasNum_GT(e.EntryID, 'KeyName3', 15.3) <> 0 
     AND dbo.MetaData_HasBool_EQ(e.EntryID, 'KeyName4', 1) <> 0)
Coltson answered 10/9, 2013 at 23:4 Comment(0)
M
1

I believe that performance issues with that kind of data structure may require the structure to be reworked.

However, I think this fairly simple dynamic sql allows you to query as desired, and appears to run reasonably fast in a quick test I did with over 100,000 rows in the Entry table and 500,000 in the DataValue table.

-- !! CHANGE WHERE CONDITION AS APPROPRIATE
--declare @where nvarchar(max)='where Key0=0'
declare @where nvarchar(max)='where Key1<550'

declare @sql nvarchar(max)='select * from Entry e';

select @sql=@sql
    +' outer apply (select '+DataKeyName+'='
    +case DataKeyType when 0 then 'TextValue' when 1 then 'NumericValue' when 2 then 'DateValue' when 3 then 'BoolValue' end
    +' from DataValue v where v.EntryID=e.EntryID and v.DataKeyID='+cast(DataKeyID as varchar)
    +') '+DataKeyName+' '
from DataKey;

set @sql+=@where;

exec(@sql);
Marlenmarlena answered 12/9, 2013 at 13:48 Comment(4)
I had originally created a (generated) view based on flattening the DataValues first using MAX() but it turns out the query optimizer handles it exactly like the OUTER APPLY so I actually prefer your approach. Using a (generated) view rather than the dynamic SQL answers Robs requirements better IMHO so I took the liberty to edit your answer into one.Favoritism
But strange but true my edit was rejected (**) for being 'too extensive and changing the original intention of the post'. Hmm, I don't really agree but seems I'll need to post my own answer then, sigh.Favoritism
"Reasonably fast" is a vague concept. How fast did it run compared to a properly structured normalised database?Camel
It ran in less than a second.Marlenmarlena
T
0

You have not specified any background information on how often the table is updated, how often new attributes are added and so on...

Looking at your inputs I think you could use a snapshot which flattens your normalised data. It is not ideal as columns will need to be added manually, but it can be extremely fast. The snapshot could be updated regularly with intervals depending on your users needs.

Thecla answered 11/9, 2013 at 22:36 Comment(0)
C
0

First to answer why do people use EAV or KVP even if it is so inefficient query-wise? Blogs and textbooks have many plausible reasons. But in real life, it is to avoid dealing with an uncooperative DBA.

For a small organization with small amount of data, it is ok to have a multi-use database (OLTP + DW), because inefficiencies are not noticeable. When your database gets large, it's time to replicate your online data into a data warehouse. In addition, if the data is meant for analytics, it should be replicated further from your relational data warehouse into a dimensional model or a flat and wide for consumption.

This are the data models I would expect from a large organization:

  1. OLTP
  2. Relational Data Warehouse
  3. Dimensional Model for Reporting
  4. Datamarts for Analytics.

So to answer your question, you shouldn't query against your KVP tables and creating a view on top of it doesn't make it better. It should be flattened out (i.e. pivot) into a physical table. What you have is a hybrid of 1 and 2. If there will be no users for #3, just build #4.

Cagle answered 14/9, 2013 at 19:38 Comment(0)
F
0

Based on Dan Belandi's answer I think the easiest way to use this would be by having a stored procedure/trigger that looks at the meta-data table and builds a view on the Data-table accordingly.

Code would then look like this:

 -- drop old view
IF object_id('EntryView') IS NOT NULL DROP VIEW [EntryView]
GO
-- create view based on current meta-information in [DataKey]
DECLARE @crlf char(2)
DECLARE @sql nvarchar(max)

SELECT @crlf = char(13) + char(10)

SELECT @sql = 'CREATE VIEW [EntryView]' + @crlf
            + 'AS' + @crlf
            + 'SELECT *' + @crlf
            + '  FROM [Entry] e' 

SELECT @sql = @sql + @crlf      
            + ' OUTER APPLY (SELECT '+ QuoteName(DataKeyName) + ' = ' + QuoteName((CASE DataKeyType WHEN 0 THEN 'TextValue' 
                                                                                                    WHEN 1 THEN 'NumericValue' 
                                                                                                    WHEN 2 THEN 'DateValue' 
                                                                                                    WHEN 3 THEN 'BoolValue' 
                                                                                                           ELSE '<Unknown>' END)) + @crlf
            + '                FROM [DataValue] v WHERE v.[EntryID] = e.[EntryID] AND v.[DataKeyID] = ' + CAST(DataKeyID as varchar) + ') AS ' + QuoteName(DataKeyName)
 FROM DataKey

--PRINT @sql
EXEC (@sql)

-- Example usage:

SELECT * 
  FROM EntryView 
 WHERE (Key1 = 0  AND Key2 BETWEEN '01.09.2012 00:00:00' AND '01.04.2013 23:59:00') 
    OR (Key3 > 'Test15.3' AND Key4 LIKE '%1%')
Favoritism answered 14/9, 2013 at 20:41 Comment(0)
E
0

I would use 4 tables, one for each data type:

MDat1
DataValueID (int)
EntryID (int) Foreign-Key
DataKeyID (int) Foreign-Key
TextValue (nvarchar) Nullable
MDat2
DataValueID (int)
EntryID (int) Foreign-Key
DataKeyID (int) Foreign-Key
NumericValue (float) Nullable
MDat3
DataValueID (int)
EntryID (int) Foreign-Key
DataKeyID (int) Foreign-Key
DateValue (datetime) Nullable
MDat4
DataValueID (int)
EntryID (int) Foreign-Key
DataKeyID (int) Foreign-Key
BoolValue (bit) Nullable

If i had partitioning available, i should use it on DataKeyID for all 4 tables. Then i should used 4 views:

SELECT ... FROM Entry JOIN MDat1 ON ... EnMDat1
SELECT ... FROM Entry JOIN MDat2 ON ... EnMDat2
SELECT ... FROM Entry JOIN MDat3 ON ... EnMDat3
SELECT ... FROM Entry JOIN MDat4 ON ... EnMDat4

So this example:

WHERE (KeyName1 = „Test12345“ AND KeyName2 BETWEEN ’01.09.2012 00:00:00’ AND
’01.04.2013 23:59:00’) OR (KeyName3 > 15.3 AND KeyName4 = True)

Goes like:

...EnMDat1 JOIN EnMDat3 ON ... AND EnMDat1.TextValue ='Test12345' AND EnMDat3.DateValue BETWEEN ’01.09.2012 00:00:00’ AND
’01.04.2013 23:59:00’)
...
UNION ALL 
...
EnMDat2 JOIN EnMDat4 ON ... AND EnMDat2.NumericValue > 15.3 AND EnMDat4.BoolValue = True

This will work faster than one metadata table. However you shall need an engine to build the queries if you have many different scenarios at where clauses. You can also omit the views and write the statement from scratch each time.

Ecker answered 30/11, 2017 at 16:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.