Is Star Schema (data modelling) still relevant with the Lake House pattern using Databricks?

Asked 15/11, 2021 at 22:40 Answered 16/8, 2022 at 7:21

Solved apache-spark bigdata databricks azure-databricks databricks-sql

The more I read about the Lake House architectural pattern and following the demos from Databricks I hardly see any discussion around Dimensional Modelling like in a traditional data warehouse (Kimball approach). I understand the compute and storage are much cheaper but are there any bigger impacts in terms of queries performance without the data modelling? In spark 3.0 onwards I see all the cool features like Adaptive Query Engine, Dynamic Partition Pruning etc., but is the dimensional modelling becoming obsolete because of that? If anyone implemented dimensional modelling with Databricks share your thoughts?

Fretwell answered 15/11, 2021 at 22:40 Comment(1)

May be nice to react to those seeking to enlighten you. – Thunderclap 19/11, 2021 at 16:47

The Kimball's star schema and Data Vault modeling techniques are still relevant for Lakehouse patterns, and mentioned optimizations, like, Adaptive Query Execution, Dynamic Partition Pruning, etc., combined with Data Skipping, ZOrder, bloom filters, etc. are making queries very efficient.

Really, Databricks data warehousing specialists recently published two related blog posts:

Gravelblind answered 16/8, 2022 at 7:21 Comment(0)

Not really a question for here, but interesting.

Of course Databricks et al are selling their Cloud solutions - I'm fine with that.

Taking this video https://go.incorta.com/recording-death-of-the-star-schema into account - whether paid for or the real opinion of Imhoff:

The computing power is higher at lower cost - if you manage it and you can more things on the fly.
That said, the same could be stated with SAP Hana, where you do ETL on the fly. I am not sure why every time I would want to have a virtual creation of a type 2 dimension.
Star schemas require thought and maintenance, but show focus. Performance is less of an issue.
It is true that ad hoc queries do not work well with star schemas over multiple fact tables. Try it.
Databricks has issues with sharing Clusters with SCALA, if you do it their way with pyspark it is OK.
It remains to be seen if querying via Tableau works well on Delta Lake - I need to see it for myself. In the past we had thrift server etc. for this and it did not work, but things are different now.

Where I am now we have Data Lake on HDP with delta format - and a dimensional SQL Server DWH. The latter due to the on-premises aspects of HDP.

Not having star schemas means people need more skills to query.

If I took ad hoc querying then I would elect the Lakehouse, but actually I think you need both. It's a akin to the discussion do you need ETL tools if you have Spark.

Thunderclap answered 17/11, 2021 at 20:45 Comment(0)

In our use case we access the lakehouse using PowerBI + Spark SQL and being able to significantly reduce the data volume the queries return by using the star schema makes the experience faster for the end-user and saves compute resources.

However considering things like the columnar nature of parquet files and partition pruning which both also decrease the data volume per query, I can imagine scenarios in which a reasonable setup without star schema could work.

Beals answered 16/11, 2021 at 19:50 Comment(0)

Recommended topics

Hot tags