Unique key with NULLs
Asked Answered
M

11

70

This question requires some hypothetical background. Let's consider an employee table that has columns name, date_of_birth, title, salary, using MySQL as the RDBMS. Since if any given person has the same name and birth date as another person, they are, by definition, the same person (barring amazing coincidences where we have two people named Abraham Lincoln born on February 12, 1809), we'll put a unique key on name and date_of_birth that means "don't store the same person twice." Now consider this data:

id name        date_of_birth title          salary
 1 John Smith  1960-10-02    President      500,000
 2 Jane Doe    1982-05-05    Accountant      80,000
 3 Jim Johnson NULL          Office Manager  40,000
 4 Tim Smith   1899-04-11    Janitor         95,000

If I now try to run the following statement, it should and will fail:

INSERT INTO employee (name, date_of_birth, title, salary)
VALUES ('Tim Smith', '1899-04-11', 'Janitor', '95,000')

If I try this one, it will succeed:

INSERT INTO employee (name, title, salary)
VALUES ('Jim Johnson', 'Office Manager', '40,000')

And now my data will look like this:

id name        date_of_birth title          salary
 1 John Smith  1960-10-02    President      500,000
 2 Jane Doe    1982-05-05    Accountant      80,000
 3 Jim Johnson NULL          Office Manager  40,000
 4 Tim Smith   1899-04-11    Janitor         95,000
 5 Jim Johnson NULL          Office Manager  40,000

This is not what I want but I can't say I entirely disagree with what happened. If we talk in terms of mathematical sets,

{'Tim Smith', '1899-04-11'} = {'Tim Smith', '1899-04-11'} <-- TRUE
{'Tim Smith', '1899-04-11'} = {'Jane Doe', '1982-05-05'} <-- FALSE
{'Tim Smith', '1899-04-11'} = {'Jim Johnson', NULL} <-- UNKNOWN
{'Jim Johnson', NULL} = {'Jim Johnson', NULL} <-- UNKNOWN

My guess is that MySQL says, "Since I don't know that Jim Johnson with a NULL birth date isn't already in this table, I'll add him."

My question is: How can I prevent duplicates even though date_of_birth is not always known? The best I've come up with so far is to move date_of_birth to a different table. The problem with that, however, is that I might end up with, say, two cashiers with the same name, title and salary, different birth dates and no way to store them both without having duplicates.

Merna answered 2/11, 2010 at 20:24 Comment(12)
Name and date of birth are NOT very unique.Howlyn
Use a sentinel birth date, e.g. '0000-00-00'.Typhoon
@Paul Tomblin: I know they're not. Can't you see that's not what the question is about, though?Merna
@smilingthax: To me that just seems like a "fake NULL." I guess I could do that but it doesn't feel right.Merna
The question is a consequence of your bad design. If you use a better design, you won't care if NULL equals or doesn't equal NULL.Howlyn
Well, what's the difference between a 'M','F','U' (unknown) enum, but NOT NULL, and one with only 'M','F', with NULL allowed? (-> Not all DBs handle NULL the same.)Typhoon
@smilingthax: You're totally right. Bad example on my part.Merna
@Paul Tomblin: Okay, so the answer is not to work with the NULLs but to use a better design. That sounds good to me. What would be a better design?Merna
In relational algebra, there is no NULL value. Every attribute should be filled, no matter what. Therefore, NULL is not exactly a value. It's a lack of value. In my opinion, it's absolutely normal that a RDBMS do not considering NULL equal to NULL. The best way to do this would be to give a unique, auto-incremented ID to every lines in your table, then either allow NULL as a value for the birthdate or create an extra table.Midwest
What if NULL means "not applicable" rather than "unknown"? This is the problem I am trying to solve at the moment. I have a unique constraint on three fields: user_id, team_scope, team_id. Here, team_scope is an enumerated field whose values represent "specific team", "all teams of which I am captain" and "all teams in my club"; only in the first case will team_id be populated. But when team_scope is one of the other values, the uniqueness constraint should still work.Energetics
(An alternative is to do away with team_scope and define magic values of team_id, though this isn't ideal from a theoretical point of view.)Energetics
Okay, I'm breaking the rules now because this is not really related to the problem you're having, but I just had the amazing conincidence of reading your article about private methods, and then, hours later, happening upon you again in a completely unrelated curiosity search about MySQL unique keys and NULLs. What are the odds!!!Finley
N
41

A fundamental property of a unique key is that it must be unique. Making part of that key Nullable destroys this property.

There are two possible solutions to your problem:

  • One way, the wrong way, would be to use some magic date to represent unknown. This just gets you past the DBMS "problem" but does not solve the problem in a logical sense. Expect problems with two "John Smith" entries having unknown dates of birth. Are these guys one and the same or are they unique individuals? If you know they are different then you are back to the same old problem - your Unique Key just isn't unique. Don't even think about assigning a whole range of magic dates to represent "unknown" - this is truly the road to hell.

  • A better way is to create an EmployeeId attribute as a surrogate key. This is just an arbitrary identifier that you assign to individuals that you know are unique. This identifier is often just an integer value. Then create an Employee table to relate the EmployeeId (unique, non-nullable key) to what you believe are the dependant attributers, in this case Name and Date of Birth (any of which may be nullable). Use the EmployeeId surrogate key everywhere that you previously used the Name/Date-of-Birth. This adds a new table to your system but solves the problem of unknown values in a robust manner.

Nexus answered 3/11, 2010 at 15:31 Comment(6)
"This just gets you past the DBMS 'problem' but does not solve the problem in a logical sense." EXACTLY! Thank you for saying this. I want to solve the design problem, not the physical implementation problem.Merna
"Don't even think about assigning a whole range of magic dates to represent 'unknown' - this is truly the road to hell." Thank you for saying this as well. It was frustrating to see this bad idea get so many upvotes.Merna
TL; DR; Normalization. This is the whole problem in most cases. If you are asking this question it's a smell that you need normalization.Worst
Not always null convention is used as unknown state as in OP question. null is often used as no optional value provided, it's defined state. DBMS shall allow user to decide how null shall be treated in context of UNIQUE key. Imagine weather table with columns (country, temp, location). location is optional. Your requirement might be to gather temperature from country or more specific location in that country, but you don't want more than one entry for country alone or country, location pair. With MySQL you can't do that easily with just one table which is disappointing.Karisa
something can be nullable, yet only one row with null value in the database, it's not mutually exclusiveBlubbery
Actually, the problem of deciding, whether a record is unique is unsolvable if the set of known data collides and there is at least one unknown data point. Adding additional (there already is an "id" column) IDs and making them part of the key does not change that - it just moves the decision to whoever assigns that ID.Fordham
A
9

I think MySQL does it right here. Some other databases (for example Microsoft SQL Server) treat NULL as a value that can only be inserted once into a UNIQUE column, but personally I find this to be strange and unexpected behaviour.

However since this is what you want, you can use some "magic" value instead of NULL, such as a date a long time in the past

Aircrew answered 2/11, 2010 at 20:29 Comment(5)
I don't necessarily disagree that MySQL handles this correctly. The end result is not what I want, though: I end up with duplicates, which is not acceptable. And to me, a "magic" value is just a "fake NULL." No offense but I find it a little hard to stomach that that's the right way to do it.Merna
Also, it's not the NULL I care about having in there twice. It's the "Jim Johnson".Merna
PostgreSQL does the same thing as MySQL (nulls are non-unique), and also claims that this behavior agrees with the SQL standard: https://mcmap.net/q/56486/-null-value-isn-39-t-uniquePorty
There isn't a case if one way or the other way is right. Depending on circumstances you need one or the other. It should be configurable. A similar design mistake was made in Java where all reference types can be null when they needed two schemes. People keep playing these tugowars and it gets ridiculous. You end up with batteries with only one pole coming out of the factory and some very red faces at the end of the production line.Zalucki
It feels to me like magic values are theoretically unsound, especially if the field in question is a foreign key. Indeed, foreign key constraints typically don't allow this. Furthermore, it feels to me like it's OK to have the applicability of a nullable field driven by another field in the table. See my comment on the question.Energetics
I
8

I recommend to create additional table column checksum which will contain md5 hash of name and date_of_birth. Drop unique key (name, date_of_birth) because it doesn't solve the problem. Create one unique key on checksum.

ALTER TABLE employee 
    ADD COLUMN checksum CHAR(32) NOT NULL;

UPDATE employee 
SET checksum = MD5(CONCAT(name, IFNULL(date_of_birth, '')));

ALTER TABLE employee 
    ADD UNIQUE (checksum);

This solution creates small technical overhead, cause for every inserted pairs you need to generate hash (same thing for every search query). For further improvements you can add trigger that will generate hash for you in every insert:

CREATE TRIGGER before_insert_employee 
BEFORE INSERT ON employee
FOR EACH ROW
    IF new.checksum IS NULL THEN
      SET new.checksum = MD5(CONCAT(new.name, IFNULL(new.date_of_birth, '')));
    END IF;
Idyllist answered 28/2, 2018 at 8:23 Comment(2)
I like this solution. But instead of using a trigger, might it make more sense to store the checksum as a generated column? I have in mind something like CHAR(32) AS (MD5(CONCAT(name, IFNULL(date_of_birth, '')))) STORED in the CREATE TABLE or ALTER TABLE command.Cahra
A generated column worked well for me in a situation where I needed this type of checksum.Grisgris
G
5

Your problem of not having duplicates based on name is not solvable because you do not have a natural key. Putting a fake date in for people whose date of birth is unknown will not solve your problem. John Smith born 1900/01/01 is still going to be a differnt person than John Smithh born 1960/03/09.

I work with name data from large and small organizations every day and I can assure you they have two different people with the same name all the time. Sometimes with the same job title. Birthdate is no guarantee of uniqueness either, plenty of John Smiths born on the same date. Heck when we work with physicians office data we have often have two doctors with the same name, address and phone number (father and son combinations)

Your best bet is to have an employee ID if you are inserting employee data to identify each employee uniquely. Then check for the uniquename in the user interface and if there are one or more matches, ask the user if he meant them and if he says no, insert the record. Then build a deupping process to fix problems if someone gets assigned two ids by accident.

Gudrunguelderrose answered 3/11, 2010 at 14:50 Comment(0)
A
4

There is a another way to do it. Adding a column(non-nullable) to represent the String value of date_of_birth column. The new column value would be ""(empty string) if date_of_birth is null.

We name the column as date_of_birth_str and create a unique constraint employee(name, date_of_birth_str). So when two recoreds come with the same name and null date_of_birth value, the unique constraint still works.

But the efforts of maintenance for the two same-meaning columns, and, the performance harm of new column, should be considered carefully.

Aristocrat answered 3/11, 2010 at 1:56 Comment(3)
Interesting idea. I don't feel awesome about it because it involves storing the same exact data twice, but I agree that it would work.Merna
If you want to go there, I would rather add a column date_of_birth_is_known with integers 1 or 0... and then you still have to add an "IF" everywhere you work with it. I had to deal with this kind of design in the past, and it is terrible, either way --- with a string or with an int flag. Sometimes people update one thing, sometimes the both - in one order, sometimes the other. Sometimes something fails outside of a transaction... Then you write scripts to validate consistency... terrible waste of time.Porty
Add a db trigger to keep the fields in sync.Immanent
B
2

You can add a generated column where the NULL value is replaced by an unused constant, e.g. zero. Then you can apply the unique constraint to this column:

CREATE TABLE employee ( 
  name VARCHAR(50) NOT NULL, 
  date_of_birth DATE, 
  uq_date_of_birth DATE AS (IFNULL(date_of_birth, '0000-00-00')) UNIQUE
);
Bimetallism answered 24/12, 2019 at 0:45 Comment(0)
B
0

The perfect solution would be support for function based UK's, but that becomes more complex as mySQL would also then need to support function based indexes. This would prevent the need to use "fake" values in place of NULL, while also allowing developers the ability to decide how to treat NULL values in UK's. Unfortunately, mySQL doesn't currently support such functionality that I am aware of, so we're left with workarounds.

CREATE TABLE employee( 
 name CHAR(50) NOT NULL, 
 date_of_birth DATE, 
 title CHAR(50), 
 UNIQUE KEY idx_name_dob (name, IFNULL(date_of_birth,'0000-00-00 00:00:00'))
);

(Note the use of the IFNULL() function in the unique key definition)

Banderilla answered 21/10, 2011 at 20:25 Comment(1)
This yields ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'date_of_birth,'0000-00-00 00:00:00')) )' at line 1 in MySQL 5.5Stigma
N
0

I had a similar problem to this, but with a twist. In your case, every employee has a birthday, although it may be unknown. In that case, it makes logical sense for the system to assign two values for employees with unknown birthdays but otherwise identical information. NealB's accepted answer is very accurate.

However, the problem I encountered was one in which the data field did not necessarily have a value. For example, if you added a 'name_of_spouse' field to your table, there wouldn't necessarily be a value for each row of the table. In that case, NealB's first bullet point (the 'wrong way') actually makes sense. In this case, a string 'None' should be inserted in the column name_of_spouse for each row in which there was no known spouse.

The situation where I ran into this problem was in writing a program with database to classify IP traffic. The goal was to create a graph of IP traffic on a private network. Each packet was put into a database table with a unique connection index based on its ip source and dest, port source and dest, transport protocol, and application protocol. However, many packets simply don't have an application protocol. For example, all TCP packets without an application protocol should be classed together, and should occupy one unique entry in the connections index. This is because I want those packets to form a single edge of my graph. In this situation, I took my own advice from above, and stored a string 'None' in the application protocol field to ensure that these packets formed a unique group.

Nitwit answered 8/9, 2017 at 13:56 Comment(0)
R
0

I were looking for one solution and the Alexander Yancharuk suggested was good idea for me. But in my case my columns are foreign keys and employee_id can be null.

I have this structure:


+----+---------+-------------+
| id | room_id | employee_id |
+----+---------+-------------+
|  1 |       1 | NULL        |
|  2 |       2 | 1           |
+----+---------+-------------+

And the room_id with employee_id NULL can not be duplicated

I solved adding a trigger before insert, like this:

DELIMITER $$
USE `db`$$
CREATE DEFINER=`root`@`%` TRIGGER `db`.`room_employee` BEFORE INSERT ON `room_employee` FOR EACH ROW
BEGIN
    IF EXISTS (
            SELECT room_id, employee_id
            FROM room_employee
            WHERE (NEW.room_id = room_employee.room_id AND NEW.employee_id IS NULL AND room_employee.employee_id IS NULL)
        ) THEN
        CALL `The room Can not be duplicated on room employee table`;
    END IF;
END$$
DELIMITER ;

I also added a constraint unique for room_id and employee_id

Rothstein answered 6/11, 2020 at 11:30 Comment(0)
G
0

I think the fundamental question here is what you actually mean with

INSERT INTO employee (name, title, salary) VALUES ('Jim Johnson', 'Office Manager', '40,000')

Your own definition of a person is name AND birth date, so what does this statement mean in that context? I'd say that the solution to your problem is to prohibit inserting half identities, like the one above, by adding NOT NULL on both your name and date_of_birth columns. That way, the statement will fail and force you to enter complete identities and the unique key will do its job to prevent you from entering the same person twice.

Greengrocery answered 8/2, 2022 at 1:15 Comment(0)
P
-3

In simple words,the role of Unique constraint is to make the field or column. The null destroys this property as database treats null as unknown

Inorder to avoid duplicates and allow null:

Make unique key as Primary key

Pe answered 6/5, 2015 at 9:43 Comment(2)
As far as I can tell, making a column part of a primary key actually removes its ability to accept NULL as a value.Aboveground
This is simply the wrong information. Please correct your answerFrigate

© 2022 - 2024 — McMap. All rights reserved.