SQL - many-to-many table primary key

Asked 3/2, 2010 at 7:14 Answered 9/7, 2018 at 18:28

156

This question comes up after reading a comment in this question:

When you create a many-to-many table, should you create a composite primary key on the two foreign key columns, or create a auto-increment surrogate "ID" primary key, and just put indexes on your two FK columns (and maybe a unique constraint)? What are the implications on performance for inserting new records/re-indexing in each case?

Basically, this:

PartDevice
----------
PartID (PK/FK)
DeviceID (PK/FK)

vs. this:

PartDevice
----------
ID (PK/auto-increment)
PartID (FK)
DeviceID (FK)

The commenter says:

making the two IDs the PK means the table is physically sorted on the disk in that order. So if we insert (Part1/Device1), (Part1/Device2), (Part2/Device3), then (Part 1/Device3) the database will have to break the table apart and insert the last one between entries 2 and 3. For many records, this becomes very problematic as it involves shuffling hundreds, thousands, or millions of records every time one is added. By contrast, an autoincrementing PK allows the new records to be tacked on to the end.

The reason I'm asking is because I've always been inclined to do the composite primary key with no surrogate auto-increment column, but I'm not sure if the surrogate key is actually more performant.

Ran answered 3/2, 2010 at 7:14 Comment(4)

Here's a silimar question posted on SO: #344568 – Dawes 3/2, 2010 at 10:3

(Tried to add this to my previous comment but can't) Depending on the number of inserts you can also periodically rebuild your index to ensure it returns results quickly. In SQL Server you can also tweak the FILLFACTOR of the index to provide enough space for inserts before it has to move data around. – Dawes 3/2, 2010 at 10:10

Doesn't the answer to this depend on what DBMS is used? I suspect MySQL will behave in a way in this case, SQL-Server slightly in another way etc. – Micamicaela 11/6, 2018 at 13:57

Caveat: Without a specific database tag, much of what is said here is suspect. Different engines work differently! – Unhand 2/6, 2020 at 3:7

115

With a simple two-column many-to-many mapping, I see no real advantage to having a surrogate key. Having a primary key on (col1,col2) is guaranteed unique (assuming your col1 and col2 values in the referenced tables are unique) and a separate index on (col2,col1) will catch those cases where the opposite order would execute faster. The surrogate is a waste of space.

You won't need indexes on the individual columns since the table should only ever be used to join the two referenced tables together.

That comment you refer to in the question is not worth the electrons it uses, in my opinion. It sounds like the author thinks the table is stored in an array rather than an extremely high performance balanced multi-way tree structure.

For a start, it's never necessary to store or get at the table sorted, just the index. And the index won't be stored sequentially, it'll be stored in an efficient manner to be able to be retrieved quickly.

In addition, the vast majority of database tables are read far more often than written. That makes anything you do on the select side far more relevant than anything on the insert side.

Polyzoic answered 3/2, 2010 at 7:20 Comment(7)

Last point isn't a good generalization : "vast majority of database tables are read far more often than written". I find many examples of associative tables that need to be written to very often e.g. a table linking customer to order. – Madel 26/8, 2013 at 19:13

@buffer, I'll stand by that comment (technically, it's a generalisation only if I say "all tables", "vast majority" is based on experience). Let's also think about your example, an order is created once (it may be updated occasionally but that's unlikely to change key/index info, more to hit things like order status. However, those updates and the selects you'll need to do to print out invoices or generate management reports are going to outweigh the original insert. – Polyzoic 26/8, 2013 at 21:19

Think Amazon - Thousands of orders created every hour. – Madel 27/8, 2013 at 7:4

@buffer, yes, but again, each of those orders will almost certainly be queried many times to do (for example) packaging, billing, status updates, business analytics and so on. The absolute number of creates is less important than the ratio between creates and reads. – Polyzoic 27/8, 2013 at 8:48

My point is, insert will matter if its being done thousands of times per hour. You can't simply ignore it just because the ratio of insert to select is < 1. In this case, a customer cares about how much time it takes to place an order. – Madel 27/8, 2013 at 9:7

@Polyzoic so in a two column many to many mapping the way to go, is a composite PK for both columns AND unique key for both columns too? If that's the case, shouldn't the unique key in both columns will handle the duplicates? – Jeth 18/6, 2016 at 6:20

Having a composite unique index key referencing to a PK if the many to many relation table row is not referenced in the other tables it will useless waste of disk and performance (due to double look ups) because there is no difference between updating or inserting to 2 columns of a composite unique indexed key or a composite clustered indexed PK. – Howland 29/8, 2016 at 16:24

No surrogate key is needed for link tables.

One PK on (col1, col2) and another unique index on (col2, col1) is all you need

Unless you use an ORM that can't cope and dictates your DB design for you...

Edit: I answered the same here: SQL: Do you need an auto-incremental primary key for Many-Many tables?

Sudarium answered 3/2, 2010 at 7:18 Comment(6)

You might be OK with a dups index on col2 instead of a unique index on (col2, col1). The advantage of the two-column index is that it allows index-only scans on either col2 alone or on both col1 and col2 (though the other index, on (col1, col2) also handles the 'both' case). The downside is the extra storage needed for the extra column. This is usually not significant, so the advice is far from awful. Nevertheless, if col1 and col2 are big or of very different sizes, you can save yourself some space without hurting performance by electing to have the second index on just the shorter column. – Ruck 3/2, 2010 at 7:39

@Sudarium : The second index on (col2, col1) doesn't need to be unique, right? – Madel 26/8, 2013 at 20:13

putting a unique index on (col1, col2) after it already is a PK is wholly redundant – Deliquescence 4/11, 2014 at 14:36

@mmcrae: where are we doing that? – Sudarium 5/11, 2014 at 14:27

In your answer One PK on (col1, col2) and another unique index on (col2, col1) is all you need – Deliquescence 5/11, 2014 at 14:46

@mmcrae: Your comment is "putting a unique index on (col1, col2)..". Column order in an index matters. (col2, col1) is not (col1, col2). The PK of (col1, col2) may not be suitable for all queries and generate scans, so having the reverse of that improves performance because it allows seeks where col2 is better. For example, FK validation when the table with col2 has a delete. The child table smuts be checked – Sudarium 7/11, 2014 at 8:26

An incremental primary key could be needed if the table is referenced. There might be details in the many-to-many table which needed to be pulled up from another table using the incremental primary key.

for example

PartDevice
----------
ID (PK/auto-increment)
PartID (FK)
DeviceID (FK)
Other Details

It's easy to pull the 'Other Details' using PartDevice.ID as the FK. Thus the use of incremental primary key is needed.

Ruisdael answered 26/11, 2011 at 12:41 Comment(2)

Thanks! I came to the answer as I was looking for almost the same scenario you described. But you drifted away from your first sentence by adding "Other details". What if I had a many to many mapping table, which I need to reference to from another table? Meaning, the many to many mapping table has not stored any other information... Would the additional ID column make sense anyway? If not, how to reference to one record of the mapping table instead? – Perrins 11/1, 2019 at 13:16

There are two options here, you can use compound key as a foreign key from your referencing table (this adds an extra column to your new table), or you can create an id column to the mapping table and set unique constraint to the original compound primary key while the new id column will become the primary key. – Undercroft 17/9, 2019 at 3:10

The shortest and most direct way I can answer your question is to say that there will be a performance impact if the two tables you are linking don't have sequential primary keys. As you stated/quoted, the index for the link table will either become fragmented, or the DBMS will work harder to insert records if the link table does not have its own sequential primary key. This is the reason most people put a sequentially incrementing primary key on link tables.

Hierodule answered 3/2, 2010 at 7:24 Comment(0)

So it seems like if the ONLY job is to link the two tables, the best PK would be the dual-column PK.

But if it serves other purposes then add another NDX as a PK with a foreign keys and a second unique index.

Index or PK is the best way to make sure there are no duplicates. PK lets tools like Microsoft Management Studio do some of the work (creating views) for you

Jaymie answered 9/7, 2018 at 18:28 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags