What is the reason that in measure fields in fact tables (dimensionally modeled data warehouses) NULL values are usually mapped as 0?
Although you've already accepted another answer, I would say that using NULL is actually a better choice, for a couple of reasons.
The first reason is that aggregates return the 'correct' answer (i.e. the one that users tend to expect) when NULL is present but give the 'wrong' answer when you use zero. Consider the results from AVG() in these two queries:
-- with zero; gives 1.5
select SUM(measure), AVG(measure)
from
(
select 1.0 as 'measure'
union all
select 2.0
union all
select 3.0
union all
select 0
) dt
-- with null; gives 2
select SUM(measure), AVG(measure)
from
(
select 1.0 as 'measure'
union all
select 2.0
union all
select 3.0
union all
select null
) dt
If we assume that the measure here is "number of days to manufacture item" and NULL represents an item that is still being produced then zero gives the wrong answer. The same reasoning applies to MIN() and MAX() too.
The second issue is that if zero is a default value, then how do you distinguish between zero as a default and zero as a real value? For example, consider a measure of "shipping charges in EUR" where NULL means that the customer picked up the order himself so there were no shipping charges and zero means the order was shipped to the customer for free. You can't use zero to replace NULL without completely changing the meaning of the data. You can obviously argue that the distinction should be clear from other dimensions (e.g. shipping method) but that adds more complexity to reports and understanding the data.
It depends upon what you're modeling, but in general it's to avoid complications with performing aggregates. And in many scenarios it makes sense to treat NULL
as 0
for those purposes.
For example, a customer with NULL
orders for a given period of time. Or a sales person with NULL
sales revenue (shame on him!).
COUNT
handles NULL
differently though so it still makes sense. You could explicitly count the number of NULL
values in a relation. You can't really add up (i.e. SUM
) the values 5 + 3 + 20 + NULL + 8
. –
Polluted The main reason is that the database treats nulls differently from blanks or zeros, even though they look like blanks or zeros to the human eye.
Here is a link to an old design tip by Ralph Kimball on the same topic.
This blogpost talks about avoiding nulls in measures and gives a couple of suggestions.
NULL instead of 0 should be used if you intend to do an average on your fact column. This is the only time i believe NULLS are ok in a dwh fact or dimensions
if a fact value is unknown/late arriving, then leaving as NULL is best.
aggregate functions suchs as MIN,MAX work on NULLS simply ignoring them
(For the record one of Ralph Kimball's sidekicks said this in his course I intended)
with goodf as
(
select 1 x
union all
select null
union all
select 4
)
select sum(x) sumx,min(x) minx,max(x) maxx,avg(cast(x as float)) avgx
from goodf
with badf as
(
select 1 x
union all
select 0 /* unknown */
union all
select 4
)
select sum(x) sumx,min(x) minx,max(x) maxx,avg(cast(x as float)) avgx
from badf
in badf above the average comes out incorrect as it uses the zero of the unknown value as literally 0
© 2022 - 2024 — McMap. All rights reserved.