Time and date dimension in data warehouse
Asked Answered
S

4

19

I'm building a data warehouse. Each fact has it's timestamp. I need to create reports by day, month, quarter but by hours too. Looking at the examples I see that dates tend to be saved in dimension tables. alt starexample
(source: etl-tools.info)

But I think, that it makes no sense for time. The dimension table would grow and grow. On the other hand JOIN with date dimension table is more efficient than using date/time functions in SQL.

What are your opinions/solutions ?

(I'm using Infobright)

Subjectivism answered 24/3, 2010 at 11:44 Comment(1)
Hourly reports seems kind of a high resolution for a data warehouse. Is really needed/appropriate?Ram
P
11

My guess is that it depends on your reporting requirement. If you need need something like

WHERE "Hour" = 10

meaning every day between 10:00:00 and 10:59:59, then I would use the time dimension, because it is faster than

WHERE date_part('hour', TimeStamp) = 10  

because the date_part() function will be evaluated for every row. You should still keep the TimeStamp in the fact table in order to aggregate over boundaries of days, like in:

WHERE TimeStamp between '2010-03-22 23:30' and '2010-03-23 11:15' 

which gets awkward when using dimension fields.

Usually, time dimension has a minute resolution, so 1440 rows.

Pandanus answered 25/3, 2010 at 15:53 Comment(3)
To be clear, you're recommending two separate dimensions, one of days (365*10 = 3,650 records) and one of minutes (1,440 records)? I'd like to understand the advantage of splitting it up; a single DateTime dimension would be larger (365*10*24 = 87,600 records at an hourly grain) but still not huge, and would make time zone calculations much easier.Margetmargette
@JonofAllTrades by splitting it up each dimension has a sensible PK. Some facts are going to be at the date granularity (i.e. without a timestamp), and some are going to be at the time granularity. Joining a fact table at the date granularity to the dimension at time granularity is going to cause duplicates that you then need to throw more resource at to remove.Grane
@jackohug: Sure, which is why I always have a Dates table and a Times table. But when you do have date-time values, why use two keys and double join instead of a single four-byte FK to a DateTimes table? It's worked fine for me, but some people seem allergic to it, for no well-explained reason.Margetmargette
A
34

Kimball recommends having separate time- and date dimensions:

In previous Toolkit books, we have recommended building such a dimension with the minutes or seconds component of time as an offset from midnight of each day, but we have come to realize that the resulting end user applications became too difficult, especially when trying to compute time spans. Also, unlike the calendar day dimension, there are very few descriptive attributes for the specific minute or second within a day. If the enterprise has well defined attributes for time slices within a day, such as shift names, or advertising time slots, an additional time-of-day dimension can be added to the design where this dimension is defined as the number of minutes (or even seconds) past midnight. Thus this time-ofday dimension would either have 1440 records if the grain were minutes or 86,400 records if the grain were seconds.

Albite answered 24/3, 2010 at 20:41 Comment(1)
@Albite is it necessary to keep a datetime column in the fact table as another answer states? Since when filter time window like between '2010-03-22 23:30' and '2010-03-23 11:15' is really not good to operate with two join tables.Erstwhile
P
11

My guess is that it depends on your reporting requirement. If you need need something like

WHERE "Hour" = 10

meaning every day between 10:00:00 and 10:59:59, then I would use the time dimension, because it is faster than

WHERE date_part('hour', TimeStamp) = 10  

because the date_part() function will be evaluated for every row. You should still keep the TimeStamp in the fact table in order to aggregate over boundaries of days, like in:

WHERE TimeStamp between '2010-03-22 23:30' and '2010-03-23 11:15' 

which gets awkward when using dimension fields.

Usually, time dimension has a minute resolution, so 1440 rows.

Pandanus answered 25/3, 2010 at 15:53 Comment(3)
To be clear, you're recommending two separate dimensions, one of days (365*10 = 3,650 records) and one of minutes (1,440 records)? I'd like to understand the advantage of splitting it up; a single DateTime dimension would be larger (365*10*24 = 87,600 records at an hourly grain) but still not huge, and would make time zone calculations much easier.Margetmargette
@JonofAllTrades by splitting it up each dimension has a sensible PK. Some facts are going to be at the date granularity (i.e. without a timestamp), and some are going to be at the time granularity. Joining a fact table at the date granularity to the dimension at time granularity is going to cause duplicates that you then need to throw more resource at to remove.Grane
@jackohug: Sure, which is why I always have a Dates table and a Times table. But when you do have date-time values, why use two keys and double join instead of a single four-byte FK to a DateTimes table? It's worked fine for me, but some people seem allergic to it, for no well-explained reason.Margetmargette
C
5

Time should be a dimension on data warehouses, since you will frequently want to aggregate about it. You could use the snowflake-Schema to reduce the overhead. In general, as I pointed out in my comment, hours seem like an unusually high resolution. If you insist on them, making the hour of the day a separate dimension might help, but I cannot tell you if this is good design.

Crashaw answered 24/3, 2010 at 12:0 Comment(5)
If date is a dimension for 10 years it has only about 3650 records. Hour-by-hour reports are very useful here - we need to compare days: monday to monday, tuesday to tuesday and hours monday 11:00-12:00 to tuesday 11:00-12:00. Do you think that snowflake is more useful/efficent than star ?Subjectivism
Snowflake can help reduce the redundancy in the dimension-tables, but if that helps you performance- or memory-wise in your particular case I cannot say.Ram
A date dimension with 10 years and hours is still small: 87,660 rows. Further, you can summarize old data to reduce the time resolution. After 10 years, how much relevance can 10AM on Thursday actually have?Bridesmaid
depends entirely on the field, for example if i was a large Supermarket chain i would like to know peak hours history. if however i was a company making 10 to twenty sales a day i may not be all that interested in daily activity let alone hourlyWaft
How do you propose using snowflaking here? With a primary DateTime dimension, which flakes out to a Date dimension? These seems more cumbersome than having separate Date and Time dimensions, but I'm interested in the advantages you see.Margetmargette
C
4

I would recommend having seperate dimension for date and time. Date Dimension would have 1 record for each date as part of identified valid range of dates. For example: 01/01/1980 to 12/31/2025.

And a seperate dimension for time having 86400 records with each second having a record identified by the time key.

In the fact records, where u need date and time both, add both keys having references to these conformed dimensions.

Caducous answered 21/9, 2011 at 20:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.