In a junction table, should I use a Primary key and a unique constraint, or a compound/composite key?
Asked Answered
S

6

19

I have read through handfuls of what would seem to make this a duplicate question. But reading through all of these has left me uncertain. I'm hoping to get an answer based on the absolute example below, as many questions/answers trail off into debates back and forth.

If I have:

dbo.Book
--------
BookID PK int identity(1,1)

dbo.Author
----------
AuthorID PK int identity(1,1)

Now I have two choices for a simple junction table:

dbo.BookAuthor
--------------
BookID CPK and FK
AuthorID CPK and FK

The above would be a compound/composite key on both FKs, as well as set up the FK relationships for both columns - also using Cascade on delete.

OR

dbo.BookAuthor
--------------
RecordID PK int identity(1,1)
BookID FK
AuthorID FK

Foreign key relationships on BookID and AuthorID, along with Cascade on delete. Also set up a unique constraint on BookID and AuthorID.

I'm looking for a simple answer as to why one method is better than another in the ABOVE particular example. The answers that I'm reading are very detailed, and I was just about to settle on a compound key, but then watched a video where the example used an Identity column like my first example.

It seems this topic is slightly torn in half, but my gut is telling me that I should just use a composite key.

What's more efficient for querying? It seems having a PK identity column along with setting up a unique constraint on the two columns, AND the FK relationships would be more costly, even if a little.

Schonfield answered 3/3, 2015 at 23:22 Comment(2)
A curious case brought up by an editor was a book composed of chapters submitted by various authors, possibly several per chapter. As Joel Coehoorn points out, there might be another entity to be tracked: chapter.Paraphrase
@Paraphrase Right, and excellent point. Every answer thus far is correct it seems, but in my example case, there will never be another column.Schonfield
C
8

As a staunch proponent of, and proselytizer for, the benefits of surrogate keys, I none-the-less make an exception for all-key join tables such as your first example. One of the benefits of surrogate keys is that engines are generally optimized for joining on single integer fields, as the default and most common circumstance.

Your first proposal still obtains this benefit, but also has a 50% greater fan-put on each index level, reducing both the overall size and height of the indices on the join table. Although the performance benefits of this are likely negligible for anything smaller than a massive table it is best practice and comes at no cost.

When I might opt for the other design is if the relation were to accrue additional columns. At that point it is no longer strictly a join table.

Cherie answered 4/3, 2015 at 0:45 Comment(7)
My understanding was that if you have a compound key, there order in which columns appear in the compound key matters. When you are trying to lookup (think join) on the second column in the pair it is very inefficient, because the physical order (in case of clustered PK) or index (in case of non-clustered PK) won't allow for as efficient lookup. So you will end up creating index on the second column separately anyway. This feels really unnecessary asymmetric.Crownpiece
@zespri: it's fixable with a non-clustered index on the second key. If you use SQL Server or some other database that allows you to include non-key columns in an index, you could include the first key as well, and then you'd basically have two tables, one for looking up from either direction (counting the index as basically a table, which it sort of is).Pare
Should I create an index on the columns combined, as in Columns: AuthorID (ASC), BookID (ASC), or an individual index on each? Or would it be a question of which many-to-many direction would be queried more often?Schonfield
@user1447679: You require one of the indices to enforce the all-key property of the join table (perhaps created to implement a constraint) so order that by the (expected) most frequent lookup. Later on, add a non-clustered index the other way around as performance requires it.Cherie
@PieterGeerkens Using the table designer for example, in SQL Server, selecting both columns and choosing "Set Primary Key" by default creates a clustered index on the combination of the column. So in that case, you're saying I should add one more NON-clustered index on the column that would be joined/queried against more often, and perhaps add another one later on, for the other column, if performance requires it? Just making sure I understand.Schonfield
@user1447679: Two column can only be ordered in two distinct ways - no need for three indices. Every PK constraint will create an index as its enforcement mechanism.Cherie
@PieterGeerkens - So creating a compound key with a clustered index on both columns is all that's needed. I should also order the two columns keeping the one that would be joined/queried against first. This is correct?Schonfield
D
28

This is something I've always remembered from my database course way back in college. We were covering the section from the textbook on "Entity Design" and it was talking about junction tables... we called them intersect tables or intersection relations. I was actually paying attention in class that day. The professor said, in his experience, a many-to-many junction table almost always indicates an unidentified missing entity. These entities almost always end up with data of their own.

We were given an example of Student and Course entities. For a student to take a course, you need to junction between those two. What you actually have as a result is a new entity: an Enrollment. The additional data in this case would be things like Credit Type (audit vs regular) or Final Grade.

I remember that advice to this day... but I don't always follow it. What I will do in this situation is stop, and make sure to go back to the stakeholders on the issue and work with them on what data points we might still be missing in this junction. If we really can't find anything, then I'll use the compound key. When we do find data, we think of a better name and it gets a surrogate key.

Update in 2020
I still have the textbook, and by amazing coincidence both it and this question were brought to my attention within a few hours of each other. So for the curious, it was Chapter 5, section 6, of the 7th edition of this book:

https://www.amazon.com/Database-Processing-Fundamentals-Design-Implementation-dp-9332549958/dp/9332549958/

Dinin answered 4/3, 2015 at 1:13 Comment(3)
fwiw, I believe this class would have been spring 2001, so we're talking 14 years ago for this.Dinin
Sensible point about the need to look out for other attributes. Your hypothetical Enrollment table is what many people call a "junction table", although the term rarely seems to be precisely defined. If the term junction table means anything then it probably means a table with more than one foreign key. Personally I don't find that terminology very useful. There are no junction tables, just tables.Kiele
@JoelCoehoorn I barely remember last week, let alone 2001. Good job :)Schonfield
H
8

I prefer the first design, using Composite Keys. Having an identity column on the junction table does not give you an advantage even if the parent tables have them. You won't be querying the BookAuthor using the identity column, instead you would query it using the BookID and AuthorID.

Also, adding an identity would allow for duplicate BookID-AuthorID combination, unless you put a constraint.

Additionally, if your primary key is (BookID, AuthorID), you need to an index on AuthorID, BookID). This will help if you want to query the the books written by an author.

Highstepper answered 4/3, 2015 at 0:17 Comment(0)
C
8

As a staunch proponent of, and proselytizer for, the benefits of surrogate keys, I none-the-less make an exception for all-key join tables such as your first example. One of the benefits of surrogate keys is that engines are generally optimized for joining on single integer fields, as the default and most common circumstance.

Your first proposal still obtains this benefit, but also has a 50% greater fan-put on each index level, reducing both the overall size and height of the indices on the join table. Although the performance benefits of this are likely negligible for anything smaller than a massive table it is best practice and comes at no cost.

When I might opt for the other design is if the relation were to accrue additional columns. At that point it is no longer strictly a join table.

Cherie answered 4/3, 2015 at 0:45 Comment(7)
My understanding was that if you have a compound key, there order in which columns appear in the compound key matters. When you are trying to lookup (think join) on the second column in the pair it is very inefficient, because the physical order (in case of clustered PK) or index (in case of non-clustered PK) won't allow for as efficient lookup. So you will end up creating index on the second column separately anyway. This feels really unnecessary asymmetric.Crownpiece
@zespri: it's fixable with a non-clustered index on the second key. If you use SQL Server or some other database that allows you to include non-key columns in an index, you could include the first key as well, and then you'd basically have two tables, one for looking up from either direction (counting the index as basically a table, which it sort of is).Pare
Should I create an index on the columns combined, as in Columns: AuthorID (ASC), BookID (ASC), or an individual index on each? Or would it be a question of which many-to-many direction would be queried more often?Schonfield
@user1447679: You require one of the indices to enforce the all-key property of the join table (perhaps created to implement a constraint) so order that by the (expected) most frequent lookup. Later on, add a non-clustered index the other way around as performance requires it.Cherie
@PieterGeerkens Using the table designer for example, in SQL Server, selecting both columns and choosing "Set Primary Key" by default creates a clustered index on the combination of the column. So in that case, you're saying I should add one more NON-clustered index on the column that would be joined/queried against more often, and perhaps add another one later on, for the other column, if performance requires it? Just making sure I understand.Schonfield
@user1447679: Two column can only be ordered in two distinct ways - no need for three indices. Every PK constraint will create an index as its enforcement mechanism.Cherie
@PieterGeerkens - So creating a compound key with a clustered index on both columns is all that's needed. I should also order the two columns keeping the one that would be joined/queried against first. This is correct?Schonfield
L
5

Using composite key would be my choice too. Here's why:

Less storage overhead

Let's say you would use a surrogate key. Since you'd probably gonna want to query all authors for a specific book and vica versa you'd need indexes starting with both BookId and AuthorId. For performance reasons you should include the other column in both indexes to prevent a clustered key lookup. You'd probably would want to make one of them a unique to make sure no duplicate BookId/AuthorId combinations are added to the table.

So as a net result:

  • The data is stored 3 times instead of 2 times
  • 2 unique constraints are to be validated instead of 1

Querying a junction table referencing table

Even if you'd add a table like Contributions (AuthorId, BookId, ...) referencing the junction table. Most queries won't require the junction table to be touched at all. E.g.: to find all contribution of a specific author would only involve the author and contributions tables.

Lamellibranch answered 4/3, 2015 at 12:53 Comment(0)
O
4

Depending on the amount of data in the junction table, a compound key might end up causing poor performance over an auto generated sequential primary key.

The primary key is the clustered index for the table, which means that it determines the order in which rows are stored on disc. If the primary key's values are not generated sequentially (e.g. it is a composite key comprised of foreign keys from tables where rows do not fall in the same order as the junction table's rows, or it is a GUID or other random key) then each time a row is added to the junction table a reshuffle of the junction table's rows will be necessary.

Orthopedic answered 18/2, 2020 at 14:42 Comment(0)
H
1

You probably should use the compound/composite key. This way you are fully relational - one author can write many books and one book can have multiple authors.

His answered 4/3, 2015 at 0:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.