What to do with null values when modeling and normalizing?

Asked 22/11, 2016 at 3:51 Answered 22/11, 2016 at 7:1

postgresql null relational-database database-normalization

I have to create a database for a venue.

A client books a room for an event. The problem is that the clients don't always provide their name, their email, and their phone number. Most of the time it's either name and email or name and phone. It's rarely all 3 but it happens.

I need to store each of these in their respective attribute (name, email, phone). But the way they give me their info, I have a lot of null values. What can I do with these nulls? I've been told that it's better to not have nulls. I also need to normalize my table after that.

Demisemiquaver answered 22/11, 2016 at 3:51 Comment(1)

There is nothing in of itself wrong with a NULL value in your table AFAIK. The bigger question is what do you want to do with those NULL values? Would you prefer the database to replace them with a default value? Or, maybe you want to handle the NULL values in a special way when delivering the data to a UI or client? – Felipafelipe 22/11, 2016 at 3:54

SQL treats NULL specially per its version of 3VL (3-valued logic). Normalization & other relational theory does not. However, we can translate SQL designs into relational designs and back. (Assume no duplicate rows here.)

Normalization happens to relations and is defined in terms of operators that don't treat NULL specially. The term "normalization" has two most common distinct meanings: putting a table into "1NF" and into "higher NFs (normal forms)". NULL doesn't affect "normalization to 1NF". "Normalization to higher NFs" replaces a table by smaller tables that natural join back to it. For purposes of normalization you could treat NULL like a value that is allowed in the domain of a nullable column in addition to the values of its SQL type. If our SQL tables have no NULLs then we can interpret them as relations & SQL join etc as join, etc. But if you decompose where a nullable column was shared between components then realize that to reconstruct the original in SQL you have to SQL join on same-named columns being equal or both NULL. And you won't want such CKs (candidate keys) in an SQL database. Eg you can't declare it as an SQL PK (primary key) because that means UNIQUE NOT NULL. Eg a UNIQUE constraint involving a nullable column allows multiple rows that have a NULL in that column, even if the rows have the same values in every column. Eg NULLs in SQL FKs cause them to be satisfied (in various ways per MATCH mode), not to fail from not appearing in the referenced table. (But DBMSs idiosyncratically differ from standard SQL.)

Unfortunately decomposition might lead to a table with all CKs containing NULL, so that we have nothing to declare as SQL PK or UNIQUE NOT NULL. The only sure solution is to convert to a NULL-free design. After then normalizing we might want to reintroduce some nullability in the components.

In practice, we manage to design tables so that there is always a set of NULL-free columns that we can declare as CK, via SQL PK or UNIQUE NOT NULL. Then we can get rid of a nullable column by dropping it from the table and adding a table with that column and the columns of some NULL-free CK: If the column is non-NULL for a row in the old design then a row with its CK subrow and column value go in the added table; otherwise it is NULL in the old design and no corresponding row is in the added table. (The original table is a natural left join of the new ones.) Of course, we also have to modify queries from the old design to the new design.

We can always avoid NULLs via a design that adds a boolean column for each old nullable column and has the old column NOT NULL. The new column says for a row whether the old column was NULL in the old design and when true has the old column be some one value that we pick for that purpose for that type throughout the database. Of course, we also have to modify queries from the old design to the new design.

Whether you want to avoid NULL is a separate question. Your database might in some way be "better" or "worse" for your application with either design. The idea behind avoiding NULL is that it complicates the meanings of queries, hence complicates querying, in a perverse way, compared to the complication of more joins from more NULL-free tables. (That perversity is typically managed by removing NULLs in query expressions as close to where they appear as possible.)

PS Many SQL terms including PK & FK differ from the relational terms. SQL PK means something more like superkey; SQL FK means something more like foreign superkey; but it doesn't even make sense to talk about a "superkey" in SQL:

Because of the resemblance of SQL tables to relations, terms that involve relations get sloppily applied to tables. But although you can borrow terms and give them SQL meanings--value, table, FD (functional dependency), superkey, CK (candidate key), PK (primary key), FK (foreign key), join, and, predicate, NF (normal form), normalize, 1NF, etc--you can't just substitute those SQL meanings for those words in RM definitions, theorems or algorithms and get something sensible or true. Moreover SQL presentations of RM notions almost never actually tell you how to soundly apply RM notions to an SQL database. They just parrot RM presentations, oblivious to whether their use of SQL meanings for terms makes things nonsensical or invalid.

Lyssa answered 22/11, 2016 at 4:10 Comment(5)

"a not-NULL UNIQUE index allows multiple rows that have a NULL in the same column" - this may be the behaviour in your SQL product of choice but I believe it is contrary to the SQL Standard... "a FK (foreign key) with a NULL in a column is always considered to be satisfied no matter what value is in that column" - again I think this is one of those 'implementation dependent' parts of the SQL Standard. I can't be bothered to check because, bottom line is: nulls and 3VL are inconsistently specified in the SQL standard and are furthermore SQL products are inconsistent with the SQL standard. – Bilicki 16/3, 2018 at 13:58

...so, while I appreciate you've made a good attempt here, this ultimately may be a fool's errand as regards a SO answer. Hugh Darwen's book 'SQL: A Comparative Survey' attempts to reconcile the RM with SQL and is littered with 'footnotes' explaining exceptions involving nulls. – Bilicki 16/3, 2018 at 14:3

@Bilicki Re "a not-NULL UNIQUE index allows multiple rows that have a NULL in the same column" the SQL standard says UNIQUE & DISTINCT treat rows that have NULLs as different. (Although SQL Server doesn't.) "a FK (foreign key) with a NULL in a column is always considered to be satisfied" is also SQL standard, although in detail it is affected by FK MATCH mode, of which usually only SIMPLE is implemented. – Lyssa 17/3, 2018 at 1:7

Happy to stand corrected! I'm definitely with "the avoid NULL and manage by removing nulls in query expressions as close to where they appear as possible" crowd :) – Bilicki 21/3, 2018 at 12:42

Paraphrasing @user2864740: In SQL Server one can use a filtered index to get the equivalent of standard SQL UNIQUE with NULLs allowed, where multiple rows can contain NULL in a given column but each subrow on specified columns without a NULL appears only once. – Lyssa 19/5, 2020 at 8:17

First of all there is nothing wrong with nulls in a database. And they are made exactly for this purpose where attributes are unknown. To avoid nulls in a database is an advice that makes little sense in my opinion.

So you'd have three (or four) values - name (first/last), email address, and phone number - identifying a client. You can have them in a table and add a constraint to it assuring that always at least one of these columns is filled, e.g. coalesce(name, email, phone) is not null. This makes sure a booking cannot be done completely anonymously.

From your explanation it is not clear whether you will always have the same information from a client. So can it happen that a client books a room giving their name and later they book another room giving their phone instead? Or will the client be looked up in the database, their name found and the two booking assigned to them? In the latter case you can have a clients table holding all information you got so far, and the booking will contain the client record ID as a reference to this data. In the former case you may not want to have a clients table, because you cannot identify whether two clients (Jane Miller and [email protected]) are really two different clients or only one client actually.

The tables I see so far:

room (room_id, ...)
venue (venue_id, ...)
client (client_id, name, email, phone)
booking (venue_id, room_id, client_id, ...)

Wells answered 22/11, 2016 at 7:1 Comment(5)

"there is nothing wrong with nulls in a database" - I see what you did there :) – Bilicki 21/3, 2018 at 12:42

It's unknown to the database. Usually we know why a value is missing or we just don't care. A client's email is missing. So it was not given to us and we cannot use it. Some rare times we have more information about that and want to use it. If you want to know whether the email has not been given to us as yet (and we should ask the client again for it) or not, then add a status column. – Wells 21/3, 2018 at 13:30

As mentioned, usually such is not necessary. A product without a price is simply a product that is not finalized; we haven't decided for a price yet. A department without a delete date is still active and has not been (logically) deleted. A product group without a default VAT just doesn't have a default and the VAT for each product must get explicitly named. We create the database; we know what the absence of values mean. – Wells 21/3, 2018 at 13:30

Erm, so you agree "they are made exactly for this purpose where attributes are unknown" is a misstatement then? – Bilicki 22/3, 2018 at 14:47

It may not be worded perfectly, but it's still correct. NULL means a value is unknown to the database. You store a client and you don't have their phone number, so you store null. – Wells 22/3, 2018 at 15:48

Recommended topics

Hot tags