Composite primary key
Asked Answered
S

8

20

I am working on the design of a database that will be used to store data that originates from a number of different sources. The instances I am storing are assigned unique IDs by the original sources. Each instance I store should contain information about the source it came from, along with the ID it was associated by this source.

As an example, consider the following table that illustrates the problem:

----------------------------------------------------------------
| source_id | id_on_source | data                              |
----------------------------------------------------------------
| 1         | 17600        | ...                               |
| 1         | 17601        | ...                               |
| 2         | 1            | ...                               |
| 3         | 1            | ...                               |
----------------------------------------------------------------

Note that while the id_on_source is unique for each source, it is possible for the same id_on_source to be found for different sources.

I have a decent understanding of relational databases, but am far from an expert or even an experienced user. The problem I face with this design is what I should use as primary key. The data seems to dictate the use of a composite primary key of (source_id, id_on_source). After a little googling I found some heated debates on the pros and cons of composite primary keys however, leaving me a little confused.

The table will have one-to-many relationship with other tables, and will thus be referred to in the foreign keys of other tables.

I am not tied to a specific RDBMS and I am not sure if it matters for the sake of the argument, but let's say that I prefer to work with SQLite and MySQL.

What are the pros and cons of using a composite foreign key in this case? Which would you prefer?

Sunburn answered 5/9, 2009 at 11:9 Comment(0)
U
32

I personally find composite primary keys to be painful. For every table that you wish to join to your "sources" table you will need to add both the source_id and id_on_source field.

I would create a standard auto-incrementing primary key on your sources table and add a unique index on source_id and id_on_source columns.

This then allows you to add just the id of the sources table as a foreign key on other tables.

Generally I have also found support for composite primary keys within many frameworks and tooling products to be "patchy" at best and non-existent in others

Unicycle answered 5/9, 2009 at 11:24 Comment(1)
Think about a composite PK to store Era and Timestamp (1, 1970~2106) (2, 2106~2242). Because INT8, INT16, INT32, INT64 are binary based and bit based in size then we don't have suitable INT size for the year 9999. INT is not enough and BIG INT is too large.Iggie
A
14

Composite keys are tough to manage and slow to join. Since you're building a summary table, use a surrogate key (i.e.-an autoincrement/identity column). Leave your natural key columns there.

This has a lot of other benefits, too. Primarily, if you merge with a company and they have one of the same sources, but reused keys, you're going to get into trouble if you aren't using a surrogate key.

This is the widely acknowledged best practice in data warehousing (a much larger undertaking than what you're doing, but still relevant), and for good reason. Surrogates provide data integrity and quick joins. You can get burned very quickly with natural keys, so stay away from them as an identifier, and only use them on the import process.

Aquileia answered 5/9, 2009 at 11:29 Comment(5)
What trouble are you talking about, exactly? If you have conflicts on a merge, wouldn't you probably want an error rather than duplicate data?Hokum
@JeffDavis Exactly, Surrogate keys invite redundancy AFAIK.Jase
Can you explain why composite keys are slow to join? I am trying to understand why I would not use composite keys actually. If I have a table that references another with a composite key (A,B), I don't actually have to join over the whole pk. I could also write ` ON (a.A = another.A)`, right? So what makes this slower?Whippersnapper
@displayname Composite Keys should be no slower to join than a Surrogate Key, provided the CK is correctly indexed - given PK Indexes do index all of their fields I find Eric's claim hard to believe - I would want to see profiled queries and benchmarks that quantitatively demonstrate CKs are slower than SKs.Obstreperous
@Obstreperous Thank you for clarification. I am not really deep into database systems but I would not have guessed that CKs are slow to join - why would they if they, like you said, are correctly indexed. I am also not so sure if they are so tough to manage. I can get ugly if you have very deep dependencies where in each layer you inherit the CK of the dependent table but I didn't have this problem yet.Whippersnapper
H
8

You have a business requirement that the combination of those two attributes are unique. So, you should have a UNIQUE constraint on those two attributes. Whether you call that UNIQUE constraint "primary" is really just a preference, it doesn't have much impact aside from documentation.

The only question is whether you then add an extra column and mark it UNIQUE. The only reason I can see to do that is performance, which is a legitimate reason.

Personally, I don't like the approach of turning every database into essentially a graph, where the generated columns are essentially pointers and you are just traversing from one to the next. I think that throws away all of the greatness of a relational system. If you step back and think about it, you're introducing a bunch of columns that have no meaning to your business, at all. You may be interested in my related blog post.

Hokum answered 5/9, 2009 at 17:10 Comment(0)
L
6

I believe that composite keys create a very natural and descriptive data model. My experience comes from Oracle and I don't think there is any technical issues when creating a composite PK. In fact anyone analysing the data dictionary would immediately understand something about the table. In your case it would be obvious that each source_id must have unique id_on_source.

The use of natural keys often creates a hot debate, but people whom I work with like natural keys from a good data model perspective.

Lulualaba answered 5/9, 2009 at 11:42 Comment(2)
yes, but joining from a child table to a primary key tends to get messy if you have to join on two, three, four conditions - and it bloats the primary key and thus all your indices. It might feel natural, but in reality, it's not a good ideaHendecagon
point taken. Usually you will find that primary entities will have DB generated unique key. e.g. Customer table with CustomerId. Its usually secondary related tables that have composite keys and most of them has no FK referencing them. e.g. if you store history of customer phone numbers then in a Customer_contact_history table the columns CustomerId, phone, changedate may be composite PK as these 3 things are naturally unique.Lulualaba
C
3

Pretty much the only time I use a composite primary key is when the high-order part of the key is the key to another table. For example, I might create an OrderLineItem table with a primary key of OrderId + LineNumber. As many accesses against the OrderLineItem table will be "order join orderlineitem using (orderid)" or some variation of that, this is often handy. It also makes it easy when looking at database dumps to figure out what line items are connected to what order.

As others have noted, composite keys are a pain in most other circumstances because your joins have to involve all the pieces. It's more to type which means more potential for mistakes, queries are slower, etc.

Two-part keys aren't bad; I do those fairly often. I'm reluctant to use a three-part key. More than three-parts, I'd say forget it.

In your example, I suspect there's little to be gained by using the composite key. Just invent a new sequence number and let the source and source key be ordinary attributes.

Charlatan answered 5/9, 2009 at 22:49 Comment(0)
P
2

Adding an extra ID column will leave you having to enforce TWO uniqueness constraints instead of one.

Using that extra ID column as the foreign key in other referencing tables, instead of the key that presents itself naturally, will cause you to have to do MORE joins, namely in all the cases where you need the original soruce_ID plus ID_on_source along with data from the referencing table.

Prevost answered 5/9, 2009 at 16:51 Comment(2)
Do you need to enforce uniqueness in this application? If you are getting the data from these other systems, presumably it's their problem to enforce uniqueness. It gets back to what you need to accomplish.Charlatan
As to the extra join: I'd keep the source and id_on_source in the same table, whether it's the primary key or not. I don't see any reason here to have a second look-up table to do translations. Keep it all together.Charlatan
D
2

I ran into problems using a lot of composite keys and so I wouldn't recommend it (more below), I've also found there to be benefits in an independent/surrogate key (rather than natural) when trying to roll back user mistakes. The problem was that via a set of relations, one table joined two tables where for each row part of the composite was the same (this was appropriate in 3rd normal form - a comparison between two parts of a parent). I de-duplicated that part of the composite relationship in the join table (so instead of parent1ID, other1ID, parent2ID, other2ID there was parentID, other1ID, other2ID) but now the relation couldn't update changes to the primary key, because it tried to do it twice via each route and failed in the middle.

Dignadignified answered 16/11, 2011 at 10:5 Comment(0)
T
1

Some people recommend you use a Globally Unique ID (GUID): merge replication and transactional replication with updating subscriptions use uniqueidentifier columns to guarantee that rows are uniquely identified across multiple copies of the table. If the value if globally unique when it's created, then you don't need to add the source_id to make it unique.


Although a uniqueid is a good primary key, I agree that it's usually better to use a different, natural (not necessarily unique) key as your clustered index. For example if a uniqueid is the PK which identifies employees, you might want to clustered index to be the department (if your select statements usually retrieve all employees within a given department). If you do want to use a unqiqueid as the clustered index, see the NEWSEQUENTIALID() function: this creates sequential uniqueid values, which (being sequential) have better clustering performance.

Twinflower answered 5/9, 2009 at 11:35 Comment(3)
just be careful (in SQL Server) NOT to make your GUID primary key the clustered key of the table (which it is, by default) - see Kim Tripp's excellent article on why not: sqlskills.com/BLOGS/KIMBERLY/post/…Hendecagon
Regarding the GUID: If the requirement is just to have a unique id for each record, yes, this will work. But if you need to know what the source was, then you either have to post the source id into the record anyway, or you have to have a lookup table somewhere else (yuck), or you have to search all the possible sources looking for that GUID (double yuck). If you have to keep the source id anyway, a GUID adds no value.Charlatan
... no value, except that you don't need a compound primary key (the source id doesn't need to be part of the primary key).Twinflower

© 2022 - 2024 — McMap. All rights reserved.