Why does SQL standard allow duplicate rows?
Asked Answered
D

7

12

One of the core rules for the relational model is the required uniqueness for tuples (rows):

Every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column and the primary key value of the containing row.

In a SQL world, that would mean that there could never exist two rows in a table for which all the column values were equal. If there was no meaningful way to guarantee uniqueness, a surrogate key could be presented to the table.

When the first SQL standard was released, it defined no such restriction and it has been like this ever since. This seems like a root for all kind of evil.

Is there any meaningful reason why it was decided to be that way? In a practical world, where could an absence of such restriction prove to be useful? Does it outweigh the cons?

Destinee answered 10/6, 2015 at 21:23 Comment(4)
The meaningful reason is practicality. It turns out that constantly checking for and removing duplicates is an expensive operation.Clerestory
Agreed with @GordonLinoff, and in practicality using strong keys with unique constraints is far better than constantly checking and removing duplicates. Disk is cheap so the cost of implementing a constant check out does not outweigh the additional cost of the storage for the potential duplicatesHadwyn
I want to add, this is not the root of all evil in relational databases. There are definitely worse design decisions.Clerestory
Addressing by (candidate) key turns out to be a red herring. A base is set to the rows making some fill-in-the-(named-)blanks statement true; then JOIN of relations gives the rows that make the AND of their statements true, UNION the OR, MINUS the AND NOT, etc. Duplicate rows & columns (& NULL) break this straightforward interpretation of relations/queries. No-duplicates isn't impractical, it was just feared so by the SQL designers/implementers who didn't understand its importance. (Also primary keys turn out to be a red herring; a relation has one or more equally-important candidate keys.)Goebel
C
6

You're assuming that databases are there solely for storing relational data; that's certainly not what they're used for because practical considerations will always win.

A obvious example where there's no need for a primary key would be a "state" log of some description (weather/database/whatever). If you're never going to query a single value from this table you may not want to have a primary key in order to avoid having to wait for an insert into the key. If you have a use-case to pick up a single value from this table then sure, this would be a bad solution, but some people just don't need that. You can always add a surrogate key afterwards if it becomes absolutely necessary.

Another example would be a write intensive application needs to tell another process to do something. This secondary process runs every N minutes/hours/whatever. Doing the de-duplication on N million records as a one off is quicker than checking for uniqueness on every insert into the table (trust me).

What are sold as relational databases are not being used solely as relational databases. They're being used as logs, key-value stores, graph databases etc. They may not have all the functionality of the competition but some do and it's often simpler to have a single table that doesn't fit your relational model than to create a whole other database and suffer the data-transfer performance penalties.

tl;dr People aren't mathematically perfect and so won't always use the mathematically perfect method of doing something. Committees are made up of people and can realise this, sometimes.

Choric answered 10/6, 2015 at 21:46 Comment(0)
W
8

The short answer is that SQL is not relational and SQL DBMSs are not relational DBMSs.

Duplicate rows are a fundamental part of the SQL model of data because the SQL language doesn't really try to implement the relational algebra. SQL uses a bag (multiset)-based algebra instead. The results of queries and other operations in relational algebra are relations that always have distinct tuples, but SQL DBMSs don't have the luxury of dealing only with relations. Given this fundamental "feature" of the SQL language, SQL database engines need to have mechanisms for processing and storing duplicate rows.

Why was SQL designed that way? One reason seems to be that the relational model was just too big a leap of faith to make at that time. The relational model was an idea well ahead of its time. SQL on the other hand, was and remains very much rooted in the systems of three decades ago.

Wideman answered 11/6, 2015 at 10:14 Comment(1)
Hugh's oft-repeated stance on this is that duplicate removal was included in BS12 (which predated even the very first SQL system by some years) and it was no where near as costly as the duplicates pundits of the day (the American IBM lab working on relational systems in particular) wanted everyone to believe. The only plausible reason left as to why "SQL was designed this way" is pure politics.Controvert
C
6

You're assuming that databases are there solely for storing relational data; that's certainly not what they're used for because practical considerations will always win.

A obvious example where there's no need for a primary key would be a "state" log of some description (weather/database/whatever). If you're never going to query a single value from this table you may not want to have a primary key in order to avoid having to wait for an insert into the key. If you have a use-case to pick up a single value from this table then sure, this would be a bad solution, but some people just don't need that. You can always add a surrogate key afterwards if it becomes absolutely necessary.

Another example would be a write intensive application needs to tell another process to do something. This secondary process runs every N minutes/hours/whatever. Doing the de-duplication on N million records as a one off is quicker than checking for uniqueness on every insert into the table (trust me).

What are sold as relational databases are not being used solely as relational databases. They're being used as logs, key-value stores, graph databases etc. They may not have all the functionality of the competition but some do and it's often simpler to have a single table that doesn't fit your relational model than to create a whole other database and suffer the data-transfer performance penalties.

tl;dr People aren't mathematically perfect and so won't always use the mathematically perfect method of doing something. Committees are made up of people and can realise this, sometimes.

Choric answered 10/6, 2015 at 21:46 Comment(0)
C
3

The very first versions of the language did not have any form of constraints, including keys. So uniqueness could simply not be enforced. When support for constraints (keys in particular) was later added to the language, operational systems had already been written, and nobody wanted to break backward compatibility. So it (allowing duplicates) has been there ever since.

Many neat little topics of historical background, just like this one, can be found in Hugh Darwen's book "SQL : A comparative survey" (freely available from bookboon).

(EDIT : presumably the reason why there was no support for constraints in the very first versions of the language, was that at the time, Codd's main vision was that the query language would effectively be a query (i.e. read-only) language, and that the "relational" aspect of the DBMS would be limited to having a "relational wrapper layer" over existing databases which were not relational in structure. In that perspective, there is no question of "updating" in the language itself, hence no need to define constraints, because those were defined and enforced in the "existing, non-relational database". But that approach was abandoned pretty early on.)

Controvert answered 11/6, 2015 at 17:55 Comment(9)
Erwin (& @Discosultan): Re "So uniqueness could simply not be enforced.": If the things being constrained were relations (holding sets of tuples) rather than SQL tables (holding bags of tuples (among other non-relational properties) then constraining would be moot. If lack of constraint expression is a factor, it must also be accompanied by a reason for not using relations or the question is begged.Goebel
@Goebel "If the things being constrained were relations ... rather than SQL tables ..." Well the things being constrained weren't relations but SQL tables. So your logical implication, while correct, is irrelevant. The topic of "why not relations but bags" was covered fairly well by sqlvogel.Controvert
Erwin (& @Discosultan): But the question is "Why does SQL standard allow duplicate rows" and my point is you are begging the question, which is relevant. Granted the correct "answer" ends up being (per sqlvogel) "because it does". (The non-relational being possible because the OP's expectation/assumption of relationality is wrong.)Goebel
My answer to the question "why does the standard allow it" is "because the pre-standard implementations allowed it". My explanation about "why the pre-standard implementations allowed it" categorically does not presume anything like "the standard allows it". No begging the question.Controvert
@ErwinSmout unfortunately I found nothing about the duplicates issue in the book you referred. I'm just writing a year project, and wanted to quote the book — but despite I searched hard, every mention of duplicates in the book was just like it supposed to be like that, without any explanation.Accra
@Hi-Angel, section 4.6, paragraph "In more complicated examples" is a confirmation of sqlvogel's answer. Admittedly rather tacit, but what it says is, "the SQL teams never wanted to include duplicate elimination everywhere (because as sqlvogel stated, not confident enough about performance consequences), when the BS12 team showed it possible".Controvert
Weird, but I still didn't find the phrase. Section 4.6 is «Projection and Existential Quantification»; single mention of the words «complicated examples» is in 4.8 as part of a paragraph with a short note a page later that duplicate elimination as an option was a part of an original SQL. The word «BS12» happens ≈6 times, and never in the context of duplicate elimination. Are you sure we're talking about the same book? Mine is «SQL : A comparative survey» by Hugh Darwen, 2ᵗʰ edition, 2014y, from the «bookboon» you referred.Accra
Ah, no, «complicated examples» is in 4.6, right, my pdf viewer lagging. I found the phrase «Although such intelligence is quite feasible within acceptable limits (and was used in Business System 2, for example), the inclusion of DISTINCT allows SQL implementations to place the responsibility or duplicate elimination on the user.» That is all.Accra
Okay, I guess you have a first edition, and they something changed it. I didn't managed to find it, but don't worry — I have a requirement to quote only books from the last five years. Though it is indeed an interesting thing: why did they remove this from the second edition.Accra
I
1

I don't know the answer to this question, but can share my opinion that I always thought that a duplicate tuple (=row) in a relation (=table) has no meaning, no added information, value. In my 30+ years of Oracle and PostgreSQL I have spent countless hours on removing duplicates. So the argument that removing duplicates is "hard" for a system is moot. Yes it is hard, so delegate this task to the machine. That's why we use a machine. Let's not do this ourselves.

Some answer argued that, with a table with names and ages, a query on just the ages might show duplicate ages, which may be handy for running statistics. My point is that the duplicate ages are just annoying, you will need to remove the duplicates to get a result that is meaningful, that you can publish. If you need to do statistics, you should not just query for the ages, but make a query for the relation between age and the number of people with that age, which will not have duplicates:

select age, count(*)
from persons
group by age

result e.g.:

+-----+-------+
| age | count |
+-----+-------+
| 24  |   2   |
| 25  |   1   |
+-----+-------+

Think relational, every tuple has columns that relate to each other.

If a query result has duplicate rows, the duplicate is noise, with no meaning, and needs to be removed.

I have not seen a use for duplicates.

Insectile answered 6/1, 2023 at 12:28 Comment(0)
R
0

One reason that no one talks about is that Codd was simply wrong to elide duplicate rows. He ignored Russell and Whitehead's Principia Mathematica's final chapters, one of which was devoted to "Relation Arithmetic". At HP's "Internet Chapter 2" project, I was able to peel off a little money to hire one of Paul Allen's thinktank participants who had been working on quantum programming languages to bring it into computer network programming. His name was Tom Etter, and he went back and reviewed Russell's work. He discovered a flaw in it -- a limitation that Russell himself admitted -- and figured out how to remove that limitation. Perhaps Codd did look at Relation Arithmetic but was put off by this limitation. I don't know. But what I do know is that it is obvious what the utility of duplicate rows are:

What many people end up doing in SQL is keep an extra column for duplicate row counts. Aside from Codd's "twelve rules" declaring that you aren't allowed access to the counts within his "relational algebra" there is the fact that duplicate row count column doesn't belong at the same level of abstraction as the row data themselves. If you want to treat it as "data", it is what is called "metadata". There are all kinds of ways to get confused about levels of abstraction. Here's a way to get _un_confused about this particular distinction between levels:

Imagine an implementation of relational database "algebra" that permits duplicate rows. Let's try to not get so-confused and imagine this simple use case:

Age, Name
24, John
25, Mary
24, Elitsa

Now we perform a project operation from the higher dimensional relation to a lower dimensional relation, Age:

Age
24
25
24

The latent variables of the higher dimensions are absent from this view but their cases are still represented in the statistics of this table. In terms of data representation inclusion of an additional column for the times a row occurs works just fine so long as we avoid confusion about the levels of abstraction. Why might one want to just leave the duplicate rows in the projected table rather than counting them up? Well, that's very situation-dependent but a clear case is in Monte-Carlo simulation sampling and/or imputation of missing values:

You just use an even distribution random number generator to pick out which row you want to sample.

These statistics are essential to distinguish Etter's (and Russell's) "relation numbers" from Codd's "relations". Moreover, the SQL specification violates Codd's "relation algebra" in precisely this manner.

A practical implication of adopting relations as the most general formalism in computer programming languages:

Nowadays people are spending vast amounts of time dealing with parallelizing their computations. While it is true that functional programming permits a level of abstraction in which some parallelism is inherent to the structure of the program, it is limited to what might be thought of as "and parallelism". For instance:

z^2 = x^2+y^2

If x and y are specified, this is a functional expression that permits the parallel evaluation of the subexpressions x^2 and y^2 because of their independence. But what if, instead of treating "=" as an assignment, we treat it as a relation? Now, without duplicating the code, we have a different "computer program" depending on which of the variables are specified. For instance if z is specified but not x and y, this constrains the algebraic solutions to a geometric shape. This is, in fact, how parametric CAD systems operate.

To illustrate further we can return to the simple constraint:

x^2 = 1

Here, again, '=' is a constraint, but watch what now happens with the resulting relation table:

x
1
-1

This is "or parallelism". It is of a fundamentally different character than "and parallelism". It is, in fact, the sort of parallelism that differentiates between abstraction levels of application level programming and operating system level programming -- but that is far from the only way we elide "or parallelism" and cause vast suffering.

Romansh answered 26/10, 2021 at 16:9 Comment(6)
-1. To quote Codd : if something is true, then saying it twice will not make it any truer. Do you believe differently, that is, do you believe that saying it twice will make it any truer ?Controvert
Consider the difference between the expressions "n>0" and "n" -- particularly in the context of Codd's "Project" operator on a relation that has no duplicate rows. While it is true that in such a relation, there is no distinction between the two expressions, once Project has hidden a variable (a column) there is a distinction between the two expressions. It is precisely that difference upon which statistics is based.Romansh
The projection operator has nothing to do with "expressions". Its parameters are a relation and (according to the traditional pov) an attribute list. Nothing like "n>0" which has no connection at all with (neither Codd's nor anyone else's for that matter) "project operator that has no duplicate rows". You are just talking gibberish.Controvert
And your thinking error is in the fact that you seem to be convinced that those "latent variables of the higher dimensions" must somehow be retained, but then "latently" or "invisibly" or some such. That's like saying there is (or should reasonably be) a difference between the number 2 and the number 2 because the former was obtained from 1+1 and the latter was obtained from 4-2. Sheer nonsense, of course.Controvert
There are over 20k hits on the exact phrase "relational algebra expression".Romansh
As for your assertion "'n>0' which has no conection at all with... project operator": "Projection is relational algebra's counterpart of existential quantification in predicate logic. " en.wikipedia.org/wiki/Projection_(relational_algebra)Romansh
W
-1

Although that is normally how tables work, it's not practical to have it as a rule.

To follow the rule, a table must always have a primary key. That means that you can't just remove the primary key on a table and then add a different one. You would need to both changes at once, so that the table never is without a primary key.

Wiatt answered 10/6, 2015 at 21:41 Comment(1)
Interesting answer but why would that be a fundamental problem? Some DBMSs do effectively allow you to drop a key and create a new one in a single operation.Wideman
S
-2

In a SQL world, that would mean that there could never exist two rows in a table for which all the column values are equal and that's true. unless all the attributes of that tuple matches with another, it's not a duplicate one even if it dose differ only by the primary key column.

That's why we should define other key (unique key) column(s) along with the primary key to identify each record as unique.

Stilly answered 10/6, 2015 at 21:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.