Why is datekey in fact tables always INT?
Asked Answered
W

5

6

I'm looking at the datekey column from the fact tables in AdventureWorksDW and they're all of type int.

Is there a reason for this and not of type date?

I understand that creating a clustered index composed of an INT would optimize query speed. But let's say I want to get data from this past week. I can subtract 6 from date 20170704 and I'll get 20170698 which is not a valid date. So I have to cast everything to date, subtract, and then cast as int.

Right now I have a foreign key constraint to make sure that something besides 'YYYYMMDD' isn't inserted. It wouldn't be necessary with a Date type. Just now, I wanted to get some data between 6/28 and 7/4. I can't just subtract six from `20170703'; I have to cast from int to date.

It seems like a lot of hassle and not many benefits.

Thanks.

Warmth answered 5/7, 2017 at 19:4 Comment(3)
Why do you believe an index on int (4 bytes) would be more efficient than one on date datatype (3 bytes)?Dropout
Some discussion here made2mentor.com/2011/05/…Dropout
Using a surrogate key has one important benefit: if you decide to increase the granularity of your time dimension (from, say, days to hours), you can do so rather painlessly, without changing existing data (and yes, I have experience with a data warehouse that did this). Note that I am talking about a surrogate key, not an INT key that creatively encodes a date, which I can't see having any benefit over an actual date/time type.Tannertannery
B
2

Yes, you could be using a Date data type and have that as your primary key in the Fact and the dimension and you're going to save yourself a byte in the process.

And then you're going to have to deal with a sale that is recorded and we didn't know the date. What then? In a "normal" dimensional model, you define Unknown surrogate values so that people know there is data and it might be useful but it's incomplete. A common convention is to make it zero or in the negative realm. Easy to do with integers.

Dates are a little weird in that we typically use smart keys - yyyymmdd. From a debugging perspective, it's easy to quickly identify what the date is without having to look up against your dimension.

You can't make an invalid date. Soooo what then? Everyone "knows" that 1899-12-31 is the "fake" date (or whatever tickles your fancy) and that's all well and good until someone fat fingers a date and magically hit your sentinel date and now you've got valid unknowns mixed with merely bad data entry.

If you're doing date calculations against an smart key, you're doing it wrong. You need to go to your data dimension to properly resolve the value and use methods that are aware of date logic because it's ugly and nasty beyond just simple things like month lengths and leap year calculations.

Blemish answered 5/7, 2017 at 21:8 Comment(2)
I don't really buy this as a good reason. Fat fingering 1899-12-31 would be quite difficult to do when entering typical business dates. And I don't think use of datatypes should be predicated on the possibility of data entry errors. It would be easy to fat finger 2017-07-01 instead of 2017-06-01 why is fat fingering an unknown date any more of a problem? If you really wanted to avoid fat finger errors then maybe you should use a Guid instead of a smart key (or at least keys a certain levenshtein distance apart) Of course nobody does that.Dropout
In our case, there's no data entry. We import tons of data from an Oracle table, and the dates are guaranteed to be correct since the column in the Oracle table is also of type DATE(7)Warmth
D
1

Actually that fact table has a relationship to a table DimDate, and if you join that table you would get many more options for point in time search, then if you would`ve got by adding and removing days/months.

Say you need list of all orders on second Saturday of May? Or all orders on last week of december? Also some business regulate their fiscal year different. Some start in June, some start in January..

In summary, DimDate is there to provide you with flexibility when you need to do complicated date searches without doing any calculations, and using a simple index seek on DimDate

Dogear answered 5/7, 2017 at 19:15 Comment(8)
But the primary key of the date dimension could be a date rather than an integer. The question isnt asking what the use of the date dimension is but why integers such as 20170701 are used as keys rather than the datatype designed for dates.Dropout
Could be, it is just a better practice to use auto generated ever increasing integer rather than a date. No particular reasonDogear
So it is "better practice" but there is "no particular reason"? This is hardly a convincing argument. Why is it better practice?Dropout
I just told you, having an auto generated ever increasing integer gives you advantage of 1. Not manually specifying the value every time you make an insert and 2. You dont have to worry about whats my next value.. 3. You wont always have stored 'date' value and when you do searches using '2002-01-01' format, or any quoted format, query optimizer will have to implicitly convert it into date.Dogear
The example in the question isn't auto generated. It is using an explicitly defined format that encodes a date and will have gaps. E.g from 20170630 to 20170701. Inserts into date dimensions are not common anyway.Dropout
They could`ve created sequence and deleted it, my point is it is easier to specify less than 20100101 then make date conversions from a string. Also you can create a second column being a date and use the format depending on a region, and having 31-12-2017 as a primary key is a pretty bad idea, and also having 2 date columns in different format just to make a use of a primary key makes no senseDogear
Date columns don't have any format. It is stored as a 3 byte integer without any format. As far as the string format for literals just use ISO format. '2017-06-30'. Literals are pretty uncommon compared to parameters and choosing a larger datatype on the grounds you have to type a few less characters in literals doesn't seem a good reason. And the overhead of the query processor casting that to date during parsing is minuscule.Dropout
@MartinSmith So do you think it would be good /okay design to have date attribute with date datatype vs int datatype? All the projects I have worked in so far use INT for date attributes in FACT and Date dimension( int primary key and another attribute with DATE type to be used for calculations in where clause etc), but there was discussion to change INT to Date for reporting purposes. And we were thinking that this might cause performance issues as we thought INT is more efficient but theoretically it appears not to be valid.Sanbo
W
1

It's a good question, but the answer depends on what kind of datawarehouse you're aiming for. SSAS, for instance, covers tabular and multi-dimensional.

In multi-dimensional, you would never be querying the fact table itself through SQL, so the problem you note with e.g. subtracting 6 days from 20170704 would actually never arise. Because in MD SSAS you'd use MDX on the dimension itself to implement date logic (as suggested in @S4V1N's answer above). Calendar.Date.PrevMember(6). And for more complicated stuff, you can build all kinds of date hierarchies and get into MDX ParallelPeriod and FirstChild and that kind of thing.

For a datawarehouse that you're intending to use with SQL, your question has more urgency. I think that in that case @S4V1N's answer still applies: restrict your date logic to the dimension side

  1. because that's where it's already implemented (possibly with pre-built calendar and fiscal hierarchies).
  2. Because your logic will operate on an order of magnitude less rows.

I'm perfectly happy to have fact tables keyed on an INT-style date: but that's because I use MD SSAS. It could be that AdventureWorksDW was originally built with MD SSAS in mind (where whether the key used in fact tables is amenable to SQL is irrelevant), even though MS's emphasis seems to have switched to Tabular SSAS recently. Or the use of INTs for date keys could have been a "developer-nudging" design decision, meant to discourage date operations on the fact tables themselves, as opposed to on the Date dimension.

Wooton answered 6/7, 2017 at 16:21 Comment(1)
Thanks for your reply. Our database started out as a relational database, and I'm testing out SSDT due to the reports that are being requested. But I will definitely be querying with regular tsql.Warmth
K
0

The thread is pretty old, but my two cents.

At one of the clients I worked at, the design chosen was an int column. The reason given (by someone before I joined) was that there were imports from different sources - some that included time information and some that only provided the date information (both strings, to begin with).

By having an int key, we could then retain the date/datetime information in a datetime column in the Fact table, while at the same time, have a second column with just the date portion (Data type: date/datetime) and use this to join to Dim table. This way the (a) aggregations/measures would be less involved (b) we wouldn't prematurely discard time information, which may be of value at some point and (c) at that point, if required the Date dimension could be refactored to include time OR a new DateTime dimension could be created.

That said, this was the accepted trade-off there, but might not be a universal recommendation.

Kero answered 9/2, 2018 at 8:18 Comment(0)
L
0

Now a very old thread,

For non-date columns a sequential integer key is considered best practice, because it is fast, and reasonably small. A natural key which encapsulates business logic could change overtime and also may need some method of identifying which version of that dimension it is for a slowly changing dimension.

[https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/dimension-surrogate-key/][1]

Ideally for consistency a date dimension should also have a sequential integer key, so why is it different? After all the theory of debugging could be also applied to other (non-date) dimensions. From The Data Warehouse Toolkit, 3rd Edition, Kimball & Ross, page 49 (Calendar Date Dimension) is this comment

To facilitate partitioning, the primary key of a date dimension can be more meaningful, such as an integer representing YYYYMMDD, instead of a sequentially-assigned surrogate key.

Although I think this means partitioning of a fact table. I argue that the datekey is an integer to allow for consistency with other dimensions but not a sequential key to allow for easier table partitioning.

Loralorain answered 3/1, 2020 at 22:57 Comment(1)
Still does explain why you could not use a natural date datatype '2020-01-15'. It orders just the same as the increasing date as an integer and you certainly can partition using a normal dateZacynthus

© 2022 - 2025 — McMap. All rights reserved.