Surrogate Keys in Datawarehouse
Asked Answered
D

4

5

I want to understand how surrogate keys are leveraged in real-time DWH environments. I get that they add the benefit of not being dependent on source-generated data to store each dimension key and also avoid having composite key built out of natural keys from dimensions in the fact, For eg, (prod id + cust id+ time id)

But does it not add the complexity of having to maintain the lookup of (natural key, surrogate key) while we load data into facts. I have been working in BI/DW teams for last 3 years and we do not maintain any surrogate keys in our systems. We leverage natural keys to build our datamarts. One sample usecase is revenue data which is stored in transactional system, which is loaded into warehouse at customer, product, time period granularity using the same natural keys from source. We use the same to join with corresponding dimensions to build STAR schema.

Main reason I think it makes sense in our case is that business uses EDW data to do micro-analysis of data at account level, not just trending analysis. We would need to maintain data integrity in that case which we achieve using natural keys. I want to understand how other DW environments work. How do you leverage surrogate keys or natural keys in your systems.

Thanks!

Diandiana answered 29/4, 2017 at 13:36 Comment(0)
A
5

One reason is to maintain and being able to compare historical changes.

Example, if one of your product attributes changes and you wanted to look at and compare revenue before and after the attribute change, how would you do that without using surrogate product keys? Using a natural key would just overwrite the old value when you ETL.

The lookup doesn't have to be very complex to maintain. Most ETL tools have support for this and usually have some caching mechanism built in to cache lookup values.

Also, what do you mean when you say "real-time" data warehouse? Are you using ROLAP, DirectQuery or something similar? If so, you might be building your marts directly on your OLTP system and de-normalize in some semantic model. Then you could use your natural keys because there is no traditional ETL/data warehouse to do lookups and store your surrogate keys.

Lastly, granularity is not related to what type of key you are using.

Annulet answered 29/4, 2017 at 18:34 Comment(0)
N
3

An additional reason, which has not been mentioned yet, is performance. Sometimes (very often in my experience) natural keys are strings, sometimes long strings.

It seems not a big deal using 10, 20 or 30 byte string instead of a 4 byte integer, but when you have 10 dimension and hundred of millions of rows, it adds up fast.

Neil answered 3/5, 2017 at 13:43 Comment(1)
Good input...I do practice surrogate keys whenever there is performance concern. Thanks!Diandiana
N
2

If your business is stable and runs on top of a single application for everything, natural keys will work just fine, as your experience tells you.

Most businesses are not in such a state or not for very long. Mergers happen, new applications are introduced, legacy stuff refuses to die. New lines of business are started or split off and require wholesale renaming of existing natural key schemes.

Surrogate keys provide great benefits in keeping reporting dimensions stable and usable across the business when you have a bunch of separate new and legacy applications that all have their own versions of your customers and products and regularly get migrated or swapped out for similar systems with new natural key definitions. The major work is linking the various natural keys of a customer/product/whatever, assigning a surrogate key is just a simple and very helpful step in that.

Even in your scenario, I would use surrogate keys as they prepare you for future changes and are very helpful with historical data (as NITHIN B also answered) in Type 2 Dimensions.

It's quite possible to do versioning with natural keys by adding a version field to your dimension and fact tables, but it makes the joins harder to write for reporting and your whole system still gets messy if business or application changes cause the natural keys to change.

To illustrate:

Select bla from Fact F inner join Dim_Customer DC on F.Surrogate_key = DC.Surrogate_Key

is almost foolproof. If you mess this up, it will be immediately obvious in your report.

Select bla from Fact F inner join Dim_Customer DC on F.Natural_key = DC.Natural_Key and F.Version = DC.Version

does the same job, but if you forget that last line, everything will look normal but your numbers will be inflated depending on how many versions there are on average. Kinda painful when that 25% sales increase turns out to be an error.

Naxos answered 1/5, 2017 at 12:42 Comment(2)
I have not come across situations of data issues that come up with mergers etc in my experience but we do have orders dataset which go through revisions where we have an id generated from source for each revision and that revision id(which we call order id) to identify latest version. This is essentially surrogate key which we have coming from source. Thanks!Diandiana
In some cases it's actually easier to handle integrations using natural keys as well since that's usually the criteria. Duplicated natural keys has been a very common outcome from pure surrogate key usage IME. Ultimately, for accuracy, you have to get down to a natural key in most cases - I've always found it a lot easier and more robust but the programming is harder.Grisette
M
1

Could you please post a sample design.

I would be interested to see how you can load a fact table with Dimension Keys which are natural keys. Kimball design never recommends it.

My stand on Surrogate Keys in DWH.

  1. Surrogate keys give you a lot of flexibility with Type 2 Dimensions, ie if you have Type 2 Dimensions. For eg: You can track changes of a customer if he or she changes her second name. You can have rows withe old values and new values.
  2. Fact tables usually hold keys which are surrogate keys. It makes your star schema neat and tidy and robust.

However I am not jumping queues here, would wait for your design before going pro or against your stand.

Cheers Nithin

Morehead answered 30/4, 2017 at 2:40 Comment(2)
Our design is sample. Take any classic sales datamart to measure sales volume, amount, time-series metrics etc. Out fact table contains composite , keys made of natural keys of (cust_id, product id, time period id) that joins with each dimension on corresponding natural key. So far, most of our dim tables are SCD Type 1 and for the ones that needed to be type 2, we have surrogate keys flowing in from sources.Diandiana
I agree with your part 1 that surrogate keys are clean for SCD Type 2 but do not quite get how surrogate keys make things different if all the dimensions I use are SCD Type 1 as I tend to have single record for a cutomer in cust table at any point of time.Diandiana

© 2022 - 2024 — McMap. All rights reserved.