Should we use sequences or identities for our primary keys?
Asked Answered
F

5

8

We are creating a new database with 20+ tables, and our database supports:

  • sequences.
  • identity columns (generated always as identity/serial).

So, the question is: should we use sequences or identities? Which one is better? The team seems to be divided on this one, so I wanted to hear pros and cons, to help decide.

Adding database details:

  • We are creating the new database on IBM DB2, but we need to make sure it will be compatible with future plans of migration to PostgreSQL.
Fuzz answered 6/4, 2018 at 15:56 Comment(12)
Don't forget to implement your logical Primary Keys, too.Sheikdom
@Sheikdom What is a logical primary key?Fuzz
The key which comes from your data [model]. E.g. an ISBN is an already existing key or you got an associative table and (col1,col2) might be defined as key. Now you add a sequence and you got two unique constraints to implement.Sheikdom
I think you should add the database backend you use as a tag. That would prevent people from answering based on what some other database does.Zeiler
@dnoeth, I find that natural keys are usually the worst choice because most of them are subject to change. However, yes you need a unique index for them.Zeiler
In Postgres identity columns as well as serial/bigserial pseudotypes are based on sequences. So, if you use one of these, you automatically use sequences.Lucius
Postgres added identity columns for compatibility with the SQL standard (and hence other RDBMSs). There's a good article about them here. If you're planning a migration from DB2, this sounds like the way to go.Sanskritic
Natural keys, but only when they really are a natural key. things like 'VIN' 'EAN'' and 'username', make good natural keys. things with regional scope like SSN not so much.Wigley
@Zeiler - "most of them are subject to change" - if your database is exposed to the risk of such a change then you've picked the wrong natural key. An ISBN rarely 'changes' but in those rare cases not every physical copy get a new barcode i.e. the 'wrong' ISBN still uniquely and correctly identifies my copy in my collection. And ISBN does not always solve book identity problems e.g. what's the ISBN (or agreed title even!) for the book widely known as 'Alice'? The point is that identity requires careful design and 'autonumber' is in most cases applied in lieu of careful design.Faradmeter
@onedaywhen, you don't pick the natural keys, the natural keys are inherent in the data. Few are suitable to be a primary key.Zeiler
@HLGEM: I don't understand your comments about 'inherent' and 'don't pick'. Maybe we are talking cross purposes. An example may help. Chemical element hydrogen can be uniquely and unambiguously identified by either atomic number 1 or the letter H (SNOMED CT ID 2141009, no doubt there are many others). Both meet the properties of a good key (small, stable, familiar to users, trusted source, etc). Most DBMSs make you pick just one primary key. There is nothing inherent in hydrogen that suggests '1' or 'H'.Faradmeter
You pick the primary key not the natural key(s). The data you intend to store in the database will have the natural keys that make sense for what you need to do. What you have to do is be sure that natural keys are created with a unique index whether they are also chosen as a primary key or not.Zeiler
C
5

Your question is about using sequences versus identity ("generated always as identity" columns, presumably). In Postgres, these would be declared as serial. These would always be some sort of number in a single column.

From the database performance perspective, there is not much difference between the two. One important difference is that some databases will cache identity columns, which can speed inserts but cause gaps. The rules for caching sequences might be different. In a high transaction environment, inadequate caching can be a performance bottleneck. Sharing a single sequence across multiple tables makes this problem worse.

There is a bigger difference from a data management perspective. A sequence requires managing two objects (the table and the sequence). An identity or serial column is built into the table.

For a single table, I have only considered using sequences in databases that do not support built-in auto-increment/serial/identity columns (ahem, "Oracle"). Otherwise, I would use the mechanism designed to work with tables.

I do want to point out that using an auto-incremented surrogate key has other benefits. This should also be the key used for clustering the data, if such a concept exists in the database. New inserts are then always at the "end" (although if you are deleting data, then pages might only be partially used). The primary key should also be the only key used for foreign key references, even if other columns -- in isolation or together -- are unique and candidate primary keys.

Cyathus answered 6/4, 2018 at 17:43 Comment(2)
So performance wise, identities may have an edge, thanks. Just found that Oracle 12c does have identities (it took only 37 years!).Fuzz
"In Postgres, these would be declared as serial" - true in 9.x, but PG10 supports identity columns as well.Sanskritic
P
2

The best answer is to point you back to your situation.

First, many people prefer sequences, as they are easy to generate and provide a single data type to navigate your joins. Additionally many shops require single column primary keys to assist further in code complexity.

Let's talk about the downsides:

Sequences: When using b-tree indexes, sequences are generally inserted in ascending order, which can result in an "unbalanced tree" and cause less than perfect performance (on b-tree indexes) over time. Sometimes, people instead generate hashes or GUIDs to result in a more balanced tree.

Sequences can result in "hard to read" code when using "lookup tables", especially when values are hard coded in your database. Example: "where status_seq=1" is harder to read than "where status_id='ACTIVE'".

Downsides of using IDs: Mixed data types can cause confusion. Sometimes they're numeric, sometimes they're varchar or char. Many ORMs can confuse those and leave off leading zeros causing errors in your results. IE 01234 != 1234, but your ORM may return 1234 instead of 01234.

Many people store ID's in human readable form, like "VALID", or state abbreviations. This can cause headaches in the long run, so even if you do use IDs on a table, you may want to steer clear from ever showing those IDs directly to your customer.

ID fields are much more likely to "need to change" in the future, than a sequence. Example: Let's say you have a country code table and a revolution takes place and a country code changes. Do you really want to go through the main table and all the foreign keys that reference it, putting in the new country code-- or living with the old country code, cause that's your choice. If you use a sequence in that case, you simply update other non-key columns in the base table and you're good to go.

Benefits:

Benefits of Sequences: Sequences are by nature automatically generated. IDs aren't always. When adding records, do you really want a programmer or user naming an ID that cannot be easily changed? When you use sequences, there's rarely a need to renumber things, and the underlying human-readable data can be easily changed if a mistake is made.

As mentioned above, they're always a numeric datatype, and if used properly can assist in "navigating" you app (IE, usually only having to "pass around" one number to navigate your table structure)

When using an communicating between the DB and your programming language, you can count on being able to convert integers to integers without any weird data conversion issues.

IDs: Primary benefit is code that's easier to read which we already explained above.

In summary, I think it's on a case, by case basis, depending on table and column usage. If you're going to use IDs, avoid the temptation to show the value to the user. If the table's not going to change and simply holds flags, or "enum" type data, then IDs can certainly help with code readability. Otherwise, sequences are often the better choice for maintainability of your data.

Some people choose GUIDs or IDs to help with index performance, but personally, if there's any loss in code readability or the code gets more complex, I'd spend some money on better hardware before I'd write more complex code-- as the benefit is miniscule.

Source: Oracle certified DBA (training on this exact subject), and 20+ years of experience working with developers and enterprise databases.

Principal answered 6/4, 2018 at 16:58 Comment(2)
As identity columns I was considering database-generated numeric ones, not character ones. I'm sorry if this caused confusion.Fuzz
@TheImpaler -- So, what's the difference between a SEQ an an ID in that case?Principal
S
0

I'm a fan of sequences. I like it if all the IDs are the same type, and all the IDs come from the same sequence. It's not necessary, just something that lets you tease out the order in which things occur...which is often not so much a technical requirement, but a debugging aid. I tend to favor bigint to be my key type, so I'm pretty much guaranteed to never run out of IDs. If you're using int keys (or smaller), you'd want to use one sequence per table.

Having said that, there are issues to watch out for with sequences. For example, it's possible to "burn" sequences without actually putting them in data. Again, this may or may not be a problem. Generally, I haven't had to care.

Sequences are typically implemented by making a default constraint on the ID column of a table. This means a couple of things to watch out for. It's possible that a value for the column is actually provided on the insert...which doesn't 'bump' your sequence, and may collide with future inserts that do not provide a value. This to me, is the most significant concern. If all your IDs are provided by defaulting, this is a non-issue.

Procedures (and remote clients) can reserve or "burn" sequences. This is extremely convenient...lets your procedure code know in advance what the IDs are without having to commit them to data. You can always do something like:

insert someTempTable( Id, Name )
select
  next value for dbo.MySequence,
  Name
from
  dbo.SomeTable

...which burns sequence values, but the nice thing is, when I go to insert my rows from my work table someTempTable into the real table, I can rest assured that the IDs aren't going to conflict. This is simpler than with identity column-based IDs. I can build a whole series of related data in temp, and then move it all into persistent storage set-wise. This is usually a lot more efficient.

Stacistacia answered 6/4, 2018 at 16:33 Comment(4)
Never thought of using a SINGLE sequence (bigint) for all tables. No id collision anymore!Fuzz
I should say - I do make exceptions. Little tables...for enum-like lookups that are probably always id/name...will probably be smallint with no default, and populated in a post-build script with "hard-coded" Ids (which match the IDs in the equivalent enums in my external applications)Stacistacia
If you are using a single sequence to generate all your primary keys, and want to maintain sequence, you can't cache the sequence, so that could be a significant bottleneck in performance. If you cache the sequence, you are going to be generating keys out of sequence as each process will have it's own cache of keys to pull from.Pterous
For me, the fact that the sequence values may have gaps has never been an issue...since a key (for me) is only a unique identifier...and really nothing more. As I mentioned, ID order allows for some diagnostics, but not something I'd ever depend on. If gap-less ID generation is important, than identity is probably a better choice. I like sequences because they can simplify stored procedure logic.Stacistacia
Z
0

I've not used sequences, but I can discuss identity fields.

First they work quite nicely in every case where I have used them for the last 18 years of using SQL Server. This is most likely true on other databases as well as this is a critical feature for the databases that use them. We have never had any problems surrounding the use of identities. You might want to define the identity as big int when you set it up if you are expecting to have a very large database.

If you don't set up an identity at the time of table creation, it is a pain to set it up later in SQl Server, check your databases for details there. However, if you are using autogenerated keys exclusively as the PKs, you would do this at the time of table creation.

A critical thing when using identities (or sequences or GUIDs for that matter) is that in addition to the auto generated value, you need to create a unique index for the natural key(s) in your table if you have them. This will prevent data integrity problems.

Other problems can be if you have an issue with numbers being skipped on rollback. Since these are meant to be placeholders, they should not have meaning, so it may not be a problem, but I have seen cases where people needed this functionality for business reasons not technical reasons. Test both with rollbacks to see if you have gaps if you need them to not have gaps. If both have gaps, then you will need to roll your own system watching out for race conditions.

Since you say you are creating a database in DB2 to migrate to Postgres, I would set up a test with a couple of tables with identities in db2 and a couple of tables with sequences. Insert a large amount of fake data into them. Then I would test how difficult it is to port them to the Postgres database and start adding records. This may be a key piece of data in which method is better in your particular case.

You might also consider doing testing concerning performance by inserting a very large number of records to two test tables that are alike except in the way that they assign the Id. It may be that performance is acceptable both ways, it may be that one is faster than the other. the following link is for SQL Server, but the test methodology is probably something you could find useful in making your determinations. http://dba-presents.com/index.php/sql-server/25-identity-vs-sequence-performance-test

It is critical to do your own determinations of things like performance if that is a critical issue because results can be affected by your own particular set-up.

If you want a meaningful ID based on some text values and an incrementing number(such as CA1,CA2, CA3, TX1, TX2, TX3), then an identity will not work but a sequence could I think (see this article:PostgreSQL sequence based on another column). So sequences can give you more flexibility but if you don't need it, then why bother?

Probably I would also consider that it would be most confusing for maintenance (and in your case conversion) to sometimes use one and sometimes the other. Consistency in how you do things may be the key. If you have one case where sequences give you a flexibility that you must have that identities do not. I would use sequences throughout just to avoid unneeded complexities of knowing which table used what when you do the conversion.

Zeiler answered 6/4, 2018 at 18:0 Comment(1)
Note that Postgres also supports ANSI standard identity columns.Janiculum
S
0

Db2 IDENTITY columns are backed by sequences (which support caching and out of order generation for higher performance) -- the difference is purely syntactic. With an identity column:

create table t1 (
  id integer not null generated always as identity cache 100,
  foobar varchar(111)
)

you do not provide the value for that column, it is generated and inserted automatically:

insert into blah (foobar) values ('something')

If the column is not defined as IDENTITY you must explicitly create a sequence and generate values when inserting rows

create table t2 (
  id integer not null,
  foobar varchar(111)
)

create sequence myseq cache 100

insert into t2 values (next value for myseq, 'something else')

Similarly, when you need to redefine identity properties, you do that via the ALTER TABLE statement; for sequences you use ALTER SEQUENCE.

Only one column in a table can be defined as IDENTITY; if you need more than one such column you will have to use sequences for them.

Special treatment is necessary when mass-loading data using the LOAD or IMPORT utilities into tables with identity columns -- you will need to either override or ignore identity values that may be present.

Sarmiento answered 7/4, 2018 at 18:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.