ID Best Practices for Databases [closed]
Asked Answered
C

7

11

I was wondering what the best practices were for building and storing IDs. A few years ago, a professor told me about the dangers of a poorly constructed ID system, using the Social Security Number as an example. In particular, because SSNs do not have any error detection... it is impossible to tell the difference between a 9-digit string and a valid SSN. And now government agencies need things like Last Name + SSN or Birthday + SSN to keep track of your data and ensure its verification. Plus, your Social Security number is somewhat predictable based on where you were born.

Now I'm building a User database... and based off of this advice "userid mediumint auto_increment" would be unacceptable. Especially if I plan to use this ID as the primary identification for the user. (for example, if I allow the users to change their username, then the username would be more difficult to keep track than the numerical userid... requiring cascading foreign keys and whatnot.) Emails change, usernames can change, passwords change... but a userid should remain constant forever.

Clearly, auto_increment is only designed for surrogate_keys. That is, its a useful shortcut only when you already have a primary identification mechanism, but it shouldn't be used as an "innate identifier" for the data. Creating random UUID looks interesting, but the randomness turns me off.

And so I ask: whats the best practices for creating a "primary key" identification number?

Chromoprotein answered 3/12, 2010 at 22:21 Comment(6)
What about your professor's advice led you to conclude that auto-incrementing integers were inappropriate as unique identifiers for user data?Olgaolguin
Auto-incremented integers are predictable and hold no form of error-detection. At very least, I'd expect a "professional grade" ID practice to be somewhat unpredictable and self-identifying. For example, Credit Card numbers have one checksum digit, meaning that if a human types a Credit Card incorrectly, there is only a 1/10 chance that it would be accepted. They're also reasonably unpredictable, so a hacker can't just type in random credit card numbers into Amazon and hope that he even has a valid credit card number. Similarly, a hacker shouldn't blast dictionary attacks at predictable UIDs.Chromoprotein
I don't understand your comparison here. I would be stunned if credit card companies used actual credit card numbers as database IDs, rather than storing them as some heavily-secured attribute in a table. Your comment implies that knowledge of an ID would be some sort of backdoor into the database. Authentication of some sort should be the defense against unauthorized data access, not knowledge of random database values.Olgaolguin
@Chromoprotein - I submit that in this day and age, that check digit on a CC is not nearly as useful as a query against the CC company with the number, name and CCV value to verify its owner. The only way to know for sure if the CC is valid is to query the authoritative source.Biophysics
@Thomas: I'm not implying that CC numbers are validated by their check digit alone. However, it seems to be an advantage with few costs. Check digits can be easily implemented in efficient Javascript, and the user can immediately know that he made a mistake while typing. IE: Check Digits are there for the usability, and not for security.Chromoprotein
@Chromoprotein - I disagree that there are few costs especially revolving around change. There is effort involved in deriving the check digit algorithm and maintaining it in the face of a change to the identifier. Credit cards are highly distributed in that many companies can generate values (akin to the shipping company example I gave in my post). If the system being built will always be the sole authoritative source, then a check digit provides almost no advantage IMO.Biophysics
D
12

You are confusing internal database functionality with external search criteria.

Auto-increment surrogate keys are useful for internal application use. Never pass those on to the user. Identifying business objects, whether it is a user or an invoice, are done with unique information about the object, like SSN, CCN or DOB. Use as much info as necessary to uniquely identify the object.

I highly recommend that if you must supply some newly invented ID value to each customer, that it NOT be the field you link all the customer data tables on.

Dedrick answered 3/12, 2010 at 23:14 Comment(1)
This answer makes the most sense to me. Thanks.Chromoprotein
L
3

The best practice is to use an auto-increment integer. There's no real reason it shouldn't be used as an "innate identifier". It'll provide the most compact usage in foreign keys and fastest searches. Almost any other value can change and is inappropriate for use as a key.

Liven answered 3/12, 2010 at 22:33 Comment(9)
would this value eventually get to large to store with to many users?Wassail
@mike, use a 64-bit int in code and you'll never run out of values when tracking users. 9,223,372,036,854,775,807 possible values, or double that if you use an unsigned int 64.Liven
You're partially right. But we need to keep in mind that if we do not expose the ID to the user, i.e. for searching, we will not take advantage of clustered indexes whatsoever.Ontologism
@kerzek, can you explain that statement about not taking advantage of clustered indexes? most often ids are used in joins so the clustered index will reduce an extra step in gathering data resulting from a join. Besides, you're also assuming a specific implementation with clustered indexes. Not all MySQL storage engines support clustered indexes.Liven
Isn't it a problem when you use this externally that people can easily get information they are not supposed to get (e.g. how fast your user base is growing)?Pegpega
And what do you think about Best Practices in User ID Formation by Witty&Alan?Pegpega
@MartinThoma I think that has zero relevance to this question. Witty & Alan are talking about a user id in a scenario where the ID is used to identify users externally. When we are talking about a user id in a database we're talking about an internal identifier.Liven
@SamuelNeff Ok. So reading this article I can understand why one would like to have an external identifier which has a couple of different properties. And why should one have an internal identifier which is different from the external one? I don't think the difference in size will matter.Pegpega
@MartinThoma I'm not interested in arguing. If you don't see the benefits then that's fine, you can use whichever you prefer. Also keep in mind the article is 15 years old and this answer is 8 years old. Best practices and guidelines do change.Liven
S
2

Might be helpful to review what some other databases do to expose IDs.

Salesforce uses the first three characters to determine the object, and then the next 12 are incremented case-sensitive.

So a Salesforce Account starts 001, and a Salesforce Contact starts 003.

So a Salesforce Account might look like a 15 digit case-sensitive 001000246abcABC. But case-sensitive IDs are a problem for Excel (sorting, deduplicating, etc) so most people use Salesforce's 18 digit IDs which are case insensitive. There's a standard formula to convert them from 15 to 18.

Stripe prefixes their IDs with cus_ for customers or pi_ for payments. So a customer might be cus_abcdABCD123456 (14 digits) but a payment might be pi_0123456789abcdeABCDE1234 (24 digits).

Xero ID's look like this for Contacts, abcd1234-ab12-12ab-9902-abcdef123456.

QuickBooks Online has made the questionable decision to expose its IDs as company-specific incremental integers. So your invoices will be 1, 2, 3, etc. This is also problematic in that EVERY QBO company will have an Invoice ID of 1, making collisions in databases inevitable if you have multiple QBO Companies' data in one place.

Shadrach answered 22/8, 2021 at 17:28 Comment(2)
Interesting insights you've given. Could I get the links to how you got to know how these applications create their IDs?Sleepwalk
I've worked with all of them, and from there gained familiarity. You can usually see their IDs in the URL (or at least the exposed IDs).Shadrach
V
1

Comparing SSNs to auto-incremented integers is apples and oranges. Personally, I avoid GUIDs / UUIDs / UIDs unless there will be so many records in the table that it becomes inefficient or unreasonable to use an integer.

It's very rare that you will find a true natural key. What seems unique today may change tomorrow based on business requirements / laws.

Vastitude answered 3/12, 2010 at 22:39 Comment(0)
O
1

Based on our conversation above in the comments, I'm posting this as an answer. It seems as though you believe that having a random, unique ID assigned to your users would provide them with enough security that you could forgo normal methods of authentication.

At any rate, I'm confused by your comparisons between secured data and auto-incrementing, integer-based ID columns in user tables. These two types of data should never ever be intermingled. Your credit card company shouldn't be using a CCN as a primary key in a database table, and the government shouldn't be using your name or SSN as a primary key in its database tables either.

Why should you (or anyone) authenticate users with only knowledge of some secured data? Corporations are no longer allowed to authenticate users based on their SSNs, and I know my credit card company doesn't identify me based on my CCN (especially since I have more than one, and have had the card numbers on the accounts changed several times).

Even if you implemented a UUID and generated some arbitrary random number, it's still just that: a number. Active Directory authentication uses GUIDs for its IDs, but also requires users to provide usernames and passwords. Using a larger or smaller data type as an ID column doesn't mean I can wash my hands of some other type of authentication or security.

Olgaolguin answered 3/12, 2010 at 23:3 Comment(1)
I was just about to expand my post to this effect. A number, any number, by itself, is never sufficient to determine validity and authenticity to the person to whom it is associated.Biophysics
V
0

This is what sequences where designed to solve. Create a object that can atomically be increased per insert. In some DBs that is auto incremented integer and in others it's a sequence object but the idea is the same, ie create a key that can't conflict and is unique.

Also UUIDs as a ID is fine and I have used it before for special reasons. Why does the randomness "turn you off"? There is virtually no chance of a conflict.

Virgilio answered 3/12, 2010 at 22:41 Comment(0)
B
0

At the end of the day, the way to verify whether a given user's identifier is valid is the system itself. I.e., your system is the authoritative source for those identifiers. Is 555-45-9999 a valid SSN? The only way to know for sure is to have Social Security look it up and match it to the name of the person claiming to have that number. Sure, we can use the SSN identifier scheme to place a preliminary guess as to whether it is valid. However, only a lookup in their system will tell us for sure. The need for check digits would arise in highly distributed systems where, for example, you might want to allow other people to generate numbers honored by your system (e.g. shipping companies that let customers generate their own tracking numbers). Since it is your system that is going to generate the identifiers in an automated fashion, the best a check digit does for you is to help, in a rudimentary fashion, with validation on data entry or searches.

Biophysics answered 3/12, 2010 at 22:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.