Data Warehouse modelling: Data Vault vs Persistent Staging Area
Asked Answered
D

2

9

Consider the following two DWH architectures:

DWH with Raw Data Vault, layers:

  • Source systems
  • Staging area (truncated on every load, exact schema of source tables)
  • Raw Data Vault (modelled as Data Vault, contains record history, hubs/sats/links modelled after source systems structure, NO business rules applied)
  • Data Marts (dimensional models, business rules applied)

DWH with Persistent Staging Area (called PSA or HDA), layers:

  • Source systems
  • Staging area (truncated on every load, exact schema of source tables)
  • PSA (contains record history, schema of source tables + date_load/date_load_end columns etc.)
  • Data Marts (dimensional models, business rules applied)

Does the raw Data Vault concept have any benefits compared to the PSA concept? In my opinion the Data Vault modelling adds unnecessary complexity in terms of ETL, and is also slower performance-wise.

It's hard to find a real good answer on this, any thoughts?

Thanks!

Decca answered 20/1, 2020 at 20:28 Comment(2)
This is going to get closed as opinion-based. But note that a Persistent Staging Area is now more commonly called a "Data Lake", which should indicate the popularity of the approach :)Bookshelf
For me it depends a lot on your source systems. How many are they? How's the quality of their data models and so on? My experience is that raw vault can be pain if the source system data model integrity is of poor qualityHuda
C
1

Data Vault vs. Persistent Staging Area sounds to me like apples and pears - hard to compare. You should not try to define a Data Vault to capture source data without knowing the business ontology - otherwise you're building a source system vault, which offers no or little benefit to the business. Building a Data Vault on a PSA or a data lake makes much more sense to me. Landing the data as an image of the source systems and then step by step building a sustainable data collection out of it.

Cubital answered 4/3, 2020 at 21:14 Comment(0)
A
1

The complexity that is added corresponds to the relational model that is introduced earlier in the Data Vault case. I guess it depends on what level you want to model your data and make it reusable across different use-cases resulting in different data marts. What I mean is that the data marts are designed for a specific business cases and the data vault model is more designed to be overarching (enterprise model). Hence, the data marts based on DV model have no need to physically materialise any data at all. A layer of views can be set up which look like star schema tables, but which in fact have:

•   Zero maintenance cost.
•   Zero storage costs.
•   High flexibility.

Additionally, it is definitely nice to know how the data is related in a more general sense (organization wide) - if that information and the mentioned advantages are justifying the extra effort to build a DV model is difficult to judge.

Ashia answered 31/3, 2020 at 13:8 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.