Should primary key be auto_increment?
Asked Answered
N

4

6

It is better to use primary key when design tables.

But when designing a primary key, is it needed to set auto_increment?

What's the benefit?

I heard that can keep b-trees stable, but why?

If table has a unique column, which is better: set the unique column as primary key or add a new column id as auto_increment primary key?

Nationalism answered 20/4, 2014 at 8:34 Comment(2)
The benefit is that you don't have to provide a value, and the database engine will (usually) ensure unique values....Bacchius
If table has a unique column, which's better that set the unique column as primary key or add a new column 'id' as auto_increment primary key?Nationalism
U
11

I want to know when design a primary key, it is needed to setting auto_increment?

No, it's not strictly necessary. There are cases when a natural key is fine.

If done, what's the benefit?

Advantages of using an auto-increment surrogate key:

  • Surrogate keys never need to change, even if all other columns in your table are possible to change.
  • It's easier for the RDBMS to ensure uniqueness of an auto-increment key without locking and without race conditions, when you have multiple users inserting concurrently.
  • Using an integer is the most compact data type you can use for a primary key, so it results in a smaller index than using a long string, for example.
  • Efficiency of inserting into B-tree indexes (see below).
  • It's a little easier and tidier to reference a row with a single column than multiple columns, when the only other candidate key consisted of several columns.

Advantages of using a natural key:

  • The column has some meaning for the entity, for example a phone number. You don't need to store an extra column for the surrogate key.
  • Other tables using foreign keys to reference a natural primary key get a meaningful value, so they can avoid a join. For example, a table of shoes referencing colors would need to do a join if you wanted to get the color name. But if you use the color name as the primary key of colors, then that value would already be part of the shoes table.

Other cases when a surrogate auto-increment key is not needed:

  • You already have a combination of other columns (whether they are surrogate keys or natural keys) that provides a candidate key for the table. A good example is found in many-to-many tables. If a table maps movies to actors, even if both movies and actors are referenced by primary keys, then you already have a candidate key over those two columns, and you don't need yet another auto-increment column.

I listen, that can keep b-tree's stable, but i don't know why?

Inserting a value into an arbitrary place in the middle of a B-tree may cause a costly restructuring of the index.

There's an animated example here: http://www.bluerwhite.org/btree/

Look at the example "Inserting Key 33 into a B-Tree (w/ Split)" where it shows the steps of inserting a value into a B-tree node that overfills it, and what the B-tree does in response.

Now imagine that the example illustration only shows the bottom part of a B-tree that is much deeper (as would be in the case of an index B-tree has millions of entries), and filling the parent node can itself be an overflow, and force the splitting operation to continue up the the higher level in the tree. This can continue all the way to the very top of the tree if all the ancestor nodes to the top of the tree were already filled.

As the nodes split and have to be restructured, they may require more space, but they're stored on some page of the database file where there's no spare space. So the storage engine has to relocate parts of the index to another part of the file, and potentially re-write a lot of pages of index just for a single INSERT.

Auto-increment values are naturally always inserted at the very rightmost edge of the B-tree. As @ BrankoDimitrijevic points out in a comment below, this does not make it less likely that they'll cause such laborious node-splitting and restructuring to the index. But the B-tree implementation code can optimize for this case in other ways, and some do.

If table has a unique column, which's better that set the unique column as primary key or add a new column 'id' as auto_increment primary key?

If the unique column is also non-nullable, then you can use it as a primary key. Primary keys require that all of their columns are non-nullable.

Urdar answered 20/4, 2014 at 16:54 Comment(1)
The fact that maximal values are stored in the right-most leaf of the B-tree does not by itself mean that there will be less node splitting. It does, however, provide an opportunity for the DBMS to optimize for that specific case and leave the new right-most node (after the split) empty or nearly empty (in anticipation all/most new values will be inserted there), delaying the next split. I know Oracle has that (see page 31 of this presentation) - I wouldn't be surprised if other DBMSes implemented similar optimization.Exuviae
E
6

I know it is better to use primary key when design tables.

In fact, a key-less table is a multiset (since it allows duplicates), and therefore not strictly a relation (which is a set), and therefore your database would not really be "relational".

Please note that "primary" (PRIMARY KEY) and "alternate" (UNIQUE constraint) keys are logically equivalent.

But I want to know when design a primary key, it is needed to setting auto_increment?

You are actually asking multiple questions:

  1. Should I create a key?
  2. If yes, should I create a surrogate key?
  3. If yes, should it be integer?
  4. If yes, should I make it auto-incremented?

The answer to (1) is "almost always". There are some very rare cases when data is not "important" and you might skip it for performance reasons, but that's exceedingly rare.

The answer to (2) is "it depends" - main pros and cons can be found here.

The answer to (3) depends on whether you need to generate keys independently from the database (e.g. while disconnected, or while connected to a different database). If yes, you could use GUIDs (which obviously cannot be auto-incremented, but can be uniquely generated in isolation). If no, then you can just use integers - they are more compact and generally faster.

And finally, if you reached (4), then you'll almost certainly want to make it auto-incremented, for the reasons discussed below.

If done, what's the benefit?

  • The main benefit of making an integer surrogate key auto-incremented is that multiple concurrent clients will never receive the same generated value. If you just try to SELECT MAX(ID) + 1 FROM ..., there is no guarantee some other client will not try to do the same thing at the same time, and end-up with the same result (subsequently causing a key violation).
  • The other benefit is that the DBMS will use a highly optimized code path for generating new unique values.
  • The drawback is that auto-increment mechanism is typically not transaction-aware: if you generate a new ID value and then ROLLBACK the transaction, that value will not be generated again. That being said, surrogate keys don't have any meaning (if they had, they would not be surrogate), so such "holes" are inconsequential.

If table has a unique column, which's better that set the unique column as primary key or add a new column 'id' as auto_increment primary key?

If the attribute is intrinsically unique at the "logical level", then the corresponding table column must be made unique (through PRIMARY KEY or UNIQUE constraint), whether you later decide to add a surrogate key or not.

Exuviae answered 20/4, 2014 at 14:58 Comment(3)
+1 great answer, except that UNIQUE constraints are equivalent to PRIMARY constraint only if they're defined on non-nullable columns. If a UNIQUE constraint is on nullable columns, it permit nulls, allows more than one row when the value is null, and makes the table into a multiset.Urdar
@BillKarwin Thanks for spotting that. On a sidenote: not all DBMSes treat NULLs under UNIQUE constraints equally - some treat them as "unknown" (Oracle, MySQL, PostgreSQL) and some as "empty" (MS SQL Server), allowing and disallowing duplicates, respectively. Also there are differences in how mixed NULLs and non-NULLs are treated in composite UNIQUE constraints (whether the uniqueness of the non-NULL portion is enforced or not).Exuviae
Yep, I've observed that as well. In any case, PRIMARY KEY must still be non-null per the standard.Urdar
J
1

To have an auto-increment PK makes it easy to create a key that never needs to change, which in turn makes it easy to reference in other tables.

If your data is such that you have natural columns that are unique and can never change you can use them just as well. Mind you most things that "will never change" tends to do so anyway given enough time, like someones social security number...

For simplicity I always use auto-increment (identity) columns for PK.

Jacey answered 20/4, 2014 at 8:44 Comment(2)
You can reference changing PK values from other tables, if the other tables use FOREIGN KEY... ON UPDATE CASCADE.Urdar
If we're now pulling on software specific features, perhaps this issue is trivial, or suggestions should be labeled so for that reason. Oracle for example does not support the feature of ON UPDATE CASCADE. Tom Kyte Regarding On Update Cascade: "Personally -- I've never found a need or use for update cascade. I'm opposed to it. If your design requires it -- change your design now if you can."Mosa
M
-1

Support for Using Surrogate Keys in Table Definitions

Thanks to @Branko Dimitrijevic for opening the topic of relational database primary keys (PK's) by describing the role of SURROGATE KEYS and getting to the center of the discussion. Surrogate keys are by definition devoid of any intrinsic meaning aside from their uniqueness between their values in each record of their table.

Thanks also to @Mattias Åslund for your additional wisdom:

Mind you most things that "will never change" tends to do so anyway given enough time, like someones social security number...

I add that even if the value chosen as "unchangeable" really does not change, it is also just as likely for the rules of the supported business or organization itself to drift and change over time in ways that may affect the core assumptions of given design.

A useful discussion on the integration of demographic and biometric key values for tracking individuals can be found in this section on Choosing an Appropriate Key for New Databases put out by the Computer Professionals for Social Responsibility.

Cases of Alternate Unique Database Keys: A Sample Schema

I plan to approach the comments on this post with a discussion around a specific example design to explain the kinds of things that can go wrong with assigning a Primary Key that isn't a surrogate key. Many of these assumptions are taken from observations from real applications. They are remembered well because of the complexity their design introduced into other business processes as other systems and data sources became dependent on their assumptions.

The design and sample data is below, borrowed loosely from Oracle's infamous Scott/TIGER database design.

SQL Fiddle

MySQL 5.5.32 Schema Setup:

CREATE TABLE employee 
    (
     fake_ssn  varchar(15) primary key,
     last_name varchar(40),
     first_name varchar(40),
     dept_id  varchar(15),
     hire_date  date,
     salary int,
     email varchar(100)
    );

INSERT INTO employee
(fake_ssn, last_name, first_name, dept_id, hire_date, salary,
 email)
VALUES
('130-60-0101', 'MARLOWE', 'JACOB', '1200-05', date('2009/01/25'),
 8000, '[email protected]'),
('967-22-5025', 'CRACHITT', 'BOB', '1200-05', date('2010/02/05'),
 500, '[email protected]'),
('040-36-5555', 'PERRY', 'VICTORIA', '1200-02', date('2011/05/25'),
 2700, '[email protected]'),
('203-89-1010', 'STEVENS', 'KEVIN', '2955-03', date('2007/04/25'),
 1800, '[email protected]'),
('409-99-1111', 'MCLANE', 'JONATHAN', '2955-03', date('2009/03/02'),
 4200, '[email protected]');

CREATE TABLE department
    (
     dept_id  varchar(15) primary key,
     dept_manager varchar(40),
     dept_title varchar(40)
    );

INSERT INTO department
(dept_id, dept_manager, dept_title)
VALUES
('1200-05', 'MARLOWE', 'FINANCE'),
('1200-02', null, 'HR'),
('2955-03', 'JOHNM', 'MARKETING');

COMMIT;

Query 1:

SELECT fake_ssn, last_name, first_name, dept_id, hire_date,
   salary, email    
FROM employee

Results:

|    FAKE_SSN | LAST_NAME | FIRST_NAME | DEPT_ID |                       HIRE_DATE | SALARY |                          EMAIL |
|-------------|-----------|------------|---------|---------------------------------|--------|--------------------------------|
| 040-36-5555 |     PERRY |   VICTORIA | 1200-02 |      May, 25 2011 00:00:00+0000 |   2700 |        [email protected] |
| 130-60-0101 |   MARLOWE |      JACOB | 1200-05 |  January, 25 2009 00:00:00+0000 |   8000 |         [email protected] |
| 203-89-1010 |   STEVENS |      KEVIN | 2955-03 |    April, 25 2007 00:00:00+0000 |   1800 | [email protected] |
| 409-99-1111 |    MCLANE |   JONATHAN | 2955-03 |    March, 02 2009 00:00:00+0000 |   4200 |  [email protected] |
| 967-22-5025 |  CRACHITT |        BOB | 1200-05 | February, 05 2010 00:00:00+0000 |    500 |          [email protected] |

Query 2:

SELECT dept_id, dept_manager, dept_title    
FROM department

Results:

| DEPT_ID | DEPT_MANAGER | DEPT_TITLE |
|---------|--------------|------------|
| 1200-02 |       (null) |         HR |
| 1200-05 |      MARLOWE |    FINANCE |
| 2955-03 |        JOHNM |  MARKETING |

Fake Social Security Number

(The FAKE name is just a reminder that these are all randomly generated values.) Although this is often a popular "unqiue" value in personnel records and databases, according to the U.S. Social Security Administration, this value is not unique. It is also problematic because this value and its transfer is subject to strict regulation because of recently passed privacy laws.

Name Combinations

Even with the additional combinations created by including the middle initial, somehow there are still just too many people with the same name. A look at what the Social Security Administration has to say about the registered names of babies born in 2012:

SS Administration Popular Baby Names

Two decades from now, when the JACOB's and SOPHIA's of 2012 graduate from school, they will flood into the workforce beside thousands of others like them...

Name changes by marriage or legal reasons are also threats to the referential integrity of database records relying on their values as business key values.

Department ID

Some companies will try to derive keys from other values to make SMART KEYS. These types of keys when observed in practice are NOT smart at all. The values in the example: 1200-02, 1200-05, 2955-03 were intended to resemble a "smart key". The first value might be a street address or building number for a company campus or multiple-location business. The second value, ("-02", "-03", "-05") might be the floor of the building where the department is located.

Changing buildings, moving a department or relocating a business completely would render this location dependency of the DEPARTMENT ID useless.

Department Manager

This one is subtle, but there is a hole in this relational connection. A MANAGER is ALSO an employee, which makes the relational join between EMPLOYEE and DEPARTMENT a circular one:

  • Is the MANAGER (from DEPARTMENT) a foreign key constraint on the EMPLOYEE table, or
  • Is DEPT_ID (FROM EMPLOYEE) a foreign key constraint on the DEPARTMENT table?

If you give up the Foreign Key constraint between MANAGER and some key column on EMPLOYEE (LAST_NAME or FIRST_NAME + LAST_NAME), you risk non uniform values for MANAGER.

... Looking at

Query of The DEPARTMENT Table:

| DEPT_ID | DEPT_MANAGER | DEPT_TITLE |
|---------|--------------|------------|
| 1200-02 |       (null) |         HR |
| 1200-05 |      MARLOWE |    FINANCE |
| 2955-03 |        JOHNM |  MARKETING |

The misplacement of DEPT_MANAGER within the DEPARTMENT table, as there are three different ways the name of the Department Manager are represented: none (null), ALL-CAPS Last Name, ALL-CAPS First Name, Last Initial.

Conclusions

An important lesson to take from this posting is that making a key more than a key either by integrating derived values, making values based on assumptions about business rules tie down the flexibility of a database design because if the business rules change, so will the values such as Primary Keys or Joining Key values.

As the developer or maintainer of the business application, you (or your team) are better able to support the prevailing business conditions if you have taken control and own the parts that represent the internal structure of the business application itself. The primary key might never actually present itself in a customer or user-facing situation, but it should be protected so that the relations it represents does not change with time.

Special Thanks:

Image Credit from the popular Baby Names page of 2012:

http://www.ssa.gov/OACT/babynames/#ht=0

Mosa answered 20/4, 2014 at 16:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.