Azure SQL Data Warehouse Surrogate Keys

Asked 29/1, 2016 at 16:10 Answered 15/7, 2017 at 0:43

Solved azure data-warehouse azure-synapse parallel-data-warehouse

So Azure SQL Data Warehouse doesn't support identity columns, and therefore it's tricky to deal with surrogate keys.. anyone got any bold solutions on this one?

This is best i have found, and it's pretty horrific.

Backwater answered 29/1, 2016 at 16:10 Comment(2)

That link is the well established pattern for Azure SQL DW and PDW. You will get used to it :) – Patrick 30/1, 2016 at 1:35

ooo, identity is now supported in ADW! - azure.microsoft.com/en-gb/updates/… - exciting :) – Backwater 28/6, 2017 at 8:16

That is the best option - but you can use a constant value in your OVER clause to avoid having to sort on a particular value, and you don't need to use a variable.

INSERT INTO testTgtTable (SrgKey, colA, colB)
SELECT
    ROW_NUMBER() OVER(ORDER BY (SELECT 1)) + (SELECT ISNULL(MAX(SrgKey),0) SK FROM dbo.testTgtTable) SK
  , [colA]
  , [colB]
FROM testSrcTable;

Greed answered 30/1, 2016 at 3:55 Comment(0)

Sometimes a row number exists on the file or can be easily added. If it is present then this can be leveraged to generate the surrogate key values. It is a multi-step process

load the data into a staging table
Perform a MAX() lookup on the target table Surrogate key to get the current max value
CTAS or insert the data from the staging table into the target. Add the max_count constant to the row_number values

The code looks something like this:

DECLARE @max_count bigint
SET     @max_count = (SELECT MAX(ID) FROM Fact)

...

CREATE TABLE Input_Load
WITH (DISTRIBUTION = ROUND_ROBIN
     ,CLUSTERED COLUMNSTORE INDEX
     )
AS
SELECT @max_count + RowNumber
,      ...
FROM   dbo.stage_table
;

Emmott answered 10/2, 2016 at 14:23 Comment(0)

I do not think Surrogate keys based on the Hash value of the Business Key are a good solution because of the very issue you lay out regarding collisions. It defeats the purpose of the surrogate key which is to provide a unique ID for the DW regardless of the BK. All of the classic issues with "Intelligent" or "Smart" keys, or those issues related to using the BK as a PK would still exist.

Sapphira answered 2/5, 2017 at 15:51 Comment(0)

We now have Identity Column feature in Azure SQL Data warehouse. Link

Rockbottom answered 10/7, 2017 at 8:21 Comment(0)

The Identity column feature is not compatible with CTAS statements, which greatly reduces it as a "solution". It only works with INSERTS, UPDATES, DELETES which do not perform well in the ASDW

Sapphira answered 15/7, 2017 at 0:43 Comment(0)

Hash-based surrogate keys make sense to replace the sequence-based surrogate keys with the moving from SMP to MPP Data Warehouses and with the introduction of Hadoop, NoSQL and other Big Data extensions to your BI ecosystem.

Here are few reasons why might want to consider hash-based surrogate keys over the sequenced-based ones:

Consistent surrogate keys generation approach across various platforms in your BI ecosystem. Consistent hash-based keys can be generated independently in the various environments would it be any ETL tool (SSIS, DataStage, etc.), any NoSQL or MPP database or Hadoop implementation.
Hash logic based surrogate keys make more sense over sequence-based ones in ELT implementation in contrast to ETL. "Load the data into and then process it" (ELT) is preferred way in MPP and BigData solutions. Data loading and transforming processes are simplified by replacing lookups with hash value calculation. As a consequence, this shifts from I/O intensive operations (lookups) to CPU intensive ones (hash generation).
Quite often all data loading/transforming processes can be executed completely in parallel because dependencies between tables could be avoided as hash-based surrogate keys are consistent and can be generated independently.
Quite often in real-time/near real-time data update scenarios hash-based surrogate keys can be generated on-the-fly by eliminating the need to do additional lookups what allows to skip staging areas and do inserts directly into fact tables.
Consistent surrogate keys across Development, UAT and Production environments.
Joins on fixed-length hash keys are fairly optimal on most MPP data warehouse platforms.

Here are few suggestions:

Use natural business key as an input into hash function for Primary key in dimension tables.
Use concatenated natural business keys what make up the Primary key as an input into hash function for fact tables. Don't forget to separate business keys in concatenation by specific character, e.g. |, to avoid accidental collisions.
Use natural business key as an input into hash function for links to dimensions in fact tables.

However, as usual, a word of warning! Hash-based surrogate keys potentially could create collisions, i.e. the same hash value could be generated given two distinct input values. More about this you can read here and here.

Minardi answered 30/1, 2016 at 10:7 Comment(0)

Recommended topics

Hot tags