Redshift Performance of Flat Tables Vs Dimension and Facts

P

2

7

I am trying to create dimensional model on a flat OLTP tables (not in 3NF).

There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.

Polariscope answered 10/5, 2018 at 21:50 Comment(0)

B

5

When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:

It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)

This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.

Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.

A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.

Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.

Barcroft answered 11/5, 2018 at 2:15 Comment(2)

@HardyWest My answer is not specific to Amazon Redshift. Humans find it easier to query off wide, flat tables. These can be generated from more complex data models. That way, the source data is stored in a format appropriate for the data and "easy to query" tables can then be generated for the humans. Best of both worlds! – Barcroft 2/1, 2019 at 5:54

ok, thanks, hopefully you can give answer specific to AWS redshift, our company is thinking of migrating to this, I know Microsoft SSAS cube prefers dim/fact and they may disagree, however thats just Microsoft, Hadoop also prefers big wide tables, sometimes providing an answer specific to AWS technology will remove the all debate from the comments :), – Imprecise 2/1, 2019 at 6:0

H

8

You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.

The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.

Firstly you need to understand how Redshift differs from regular relational databases:

1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.

2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.

Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).

So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.

JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).

My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).

Host answered 12/5, 2018 at 6:44 Comment(9)

Both SQL Server and Oracle support columnar table storage, and it's the default for MPP SQL Server (Azure SQL Data Warehouse). – Oporto 17/5, 2018 at 20:31

@DavidBrowne-Microsoft I'm not sure what your point is, regardless of the addition of columnar features SQL Server and Oracle are both still used primarily as rowstore databases. The point of the answer is that a) there isn't a generic answer to this kind of question and b) the answer for an MPP system like Redshift is different from that for a traditional db like SQL or Oracle. – Host 21/5, 2018 at 0:27

My point is just that your point about the costs of wide tables in columnar format is applicable to any modern DBMS. – Oporto 21/5, 2018 at 12:56

Hi David, I asked a question here, are you saying we should utilize big wide or dim/fact for Azure Data Warehouse? this question is more for redshift, Thanks #54002193 – Imprecise 2/1, 2019 at 6:48

by the way, it seems like Azure Data warehouse prefers Dim and Facts, learn.microsoft.com/en-us/azure/sql-data-warehouse/… , I would like to confirm – Imprecise 2/1, 2019 at 6:50

@NathanGriffiths this is an extremely helpful answer - thank you. One follow-up question: if I use a single table as you recommend, do you also recommend making each dimension column part of an interleaved sortkey? – Risky 31/10, 2019 at 18:40

@Risky interleaved sort keys are useful if you have somewhat unpredictable query patterns where one or more of those "dimension" columns will be used in the query WHERE clause as a filter i.e. columnA = "SomeValue". If you have a predictable set of columns that will be filtered on in a particular order, compound sort key may be a better choice. Don't create an interleaved sort key on columns that contain incremental key values (identity values) or timestamps. See docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html – Host 1/11, 2019 at 4:1

@NathanGriffiths OK once again this is very helpful - thank you! – Risky 1/11, 2019 at 13:50

I also highly recommend this presentation from AWS reinvent for learning more about zone maps, sort keys and materializing columns youtu.be/TJDtQom7SAA – Host 1/11, 2019 at 20:50

B

5