So Azure SQL Data Warehouse doesn't support identity columns, and therefore it's tricky to deal with surrogate keys.. anyone got any bold solutions on this one?
This is best i have found, and it's pretty horrific.
So Azure SQL Data Warehouse doesn't support identity columns, and therefore it's tricky to deal with surrogate keys.. anyone got any bold solutions on this one?
This is best i have found, and it's pretty horrific.
That is the best option - but you can use a constant value in your OVER clause to avoid having to sort on a particular value, and you don't need to use a variable.
INSERT INTO testTgtTable (SrgKey, colA, colB)
SELECT
ROW_NUMBER() OVER(ORDER BY (SELECT 1)) + (SELECT ISNULL(MAX(SrgKey),0) SK FROM dbo.testTgtTable) SK
, [colA]
, [colB]
FROM testSrcTable;
Sometimes a row number exists on the file or can be easily added. If it is present then this can be leveraged to generate the surrogate key values. It is a multi-step process
The code looks something like this:
DECLARE @max_count bigint
SET @max_count = (SELECT MAX(ID) FROM Fact)
...
CREATE TABLE Input_Load
WITH (DISTRIBUTION = ROUND_ROBIN
,CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT @max_count + RowNumber
, ...
FROM dbo.stage_table
;
I do not think Surrogate keys based on the Hash value of the Business Key are a good solution because of the very issue you lay out regarding collisions. It defeats the purpose of the surrogate key which is to provide a unique ID for the DW regardless of the BK. All of the classic issues with "Intelligent" or "Smart" keys, or those issues related to using the BK as a PK would still exist.
We now have Identity Column feature in Azure SQL Data warehouse. Link
The Identity column feature is not compatible with CTAS statements, which greatly reduces it as a "solution". It only works with INSERTS, UPDATES, DELETES which do not perform well in the ASDW
Hash-based surrogate keys make sense to replace the sequence-based surrogate keys with the moving from SMP to MPP Data Warehouses and with the introduction of Hadoop, NoSQL and other Big Data extensions to your BI ecosystem.
Here are few reasons why might want to consider hash-based surrogate keys over the sequenced-based ones:
Consistent surrogate keys generation approach across various platforms in your BI ecosystem. Consistent hash-based keys can be generated independently in the various environments would it be any ETL tool (SSIS, DataStage, etc.), any NoSQL or MPP database or Hadoop implementation.
Hash logic based surrogate keys make more sense over sequence-based ones in ELT implementation in contrast to ETL. "Load the data into and then process it" (ELT) is preferred way in MPP and BigData solutions. Data loading and transforming processes are simplified by replacing lookups with hash value calculation. As a consequence, this shifts from I/O intensive operations (lookups) to CPU intensive ones (hash generation).
Quite often all data loading/transforming processes can be executed completely in parallel because dependencies between tables could be avoided as hash-based surrogate keys are consistent and can be generated independently.
Quite often in real-time/near real-time data update scenarios hash-based surrogate keys can be generated on-the-fly by eliminating the need to do additional lookups what allows to skip staging areas and do inserts directly into fact tables.
Consistent surrogate keys across Development, UAT and Production environments.
Joins on fixed-length hash keys are fairly optimal on most MPP data warehouse platforms.
Here are few suggestions:
Use natural business key as an input into hash function for Primary key in dimension tables.
Use concatenated natural business keys what make up the Primary key as an input into hash function for fact tables. Don't forget to separate business keys in concatenation by specific character, e.g. |, to avoid accidental collisions.
Use natural business key as an input into hash function for links to dimensions in fact tables.
However, as usual, a word of warning! Hash-based surrogate keys potentially could create collisions, i.e. the same hash value could be generated given two distinct input values. More about this you can read here and here.
© 2022 - 2024 — McMap. All rights reserved.