What are the best practices for using a GUID as a primary key, specifically regarding performance? [closed]
Asked Answered
M

9

431

I have an application that uses GUID as the Primary Key in almost all tables and I have read that there are issues about performance when using GUID as Primary Key. Honestly, I haven't seen any problem, but I'm about to start a new application and I still want to use the GUIDs as the Primary Keys, but I was thinking of using a Composite Primary Key (The GUID and maybe another field.)

I'm using a GUID because they are nice and easy to manage when you have different environments such as "production", "test" and "dev" databases, and also for migration data between databases.

I will use Entity Framework 4.3 and I want to assign the Guid in the application code, before inserting it in the database. (i.e. I don't want to let SQL generate the Guid).

What is the best practice for creating GUID-based Primary Keys, in order to avoid the supposed performance hits associated with this approach?

Meakem answered 13/8, 2012 at 16:3 Comment(3)
The issue is not supposed. If your PK is clustered then almost every insert has the potential to cause a page split. In modern versions of SQL Server this was "fixed" with NEWSEQUENTIALID(), but this loses the benefit of being able to calculate it beforehand. I strongly recommend you read up on GUIDs elsewhere as this is way too broad a question and will likely solicit a religious battle that will go on for hours...Vilberg
I'd also add that the word server is ambiguous in I want to assign the Guid on the server side (dont want to let SQL to create the GUID).Doralynne
This question has similarities to this "sql-server-guid-sort-algorithm-why" #7811102Anesthesiologist
N
625

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

You really need to keep two issues apart:

  1. the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

  2. the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as Primary and Clustering Key:

  • Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
  • 6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

PS: of course, if you're dealing with just a few hundred or a few thousand rows - most of these arguments won't really have much of an impact on you. However: if you get into the tens or hundreds of thousands of rows, or you start counting in millions - then those points become very crucial and very important to understand.

Update: if you want to have your PKGUID column as your primary key (but not your clustering key), and another column MYINT (INT IDENTITY) as your clustering key - use this:

CREATE TABLE dbo.MyTable
(PKGUID UNIQUEIDENTIFIER NOT NULL,
 MyINT INT IDENTITY(1,1) NOT NULL,
 .... add more columns as needed ...... )

ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable
PRIMARY KEY NONCLUSTERED (PKGUID)

CREATE UNIQUE CLUSTERED INDEX CIX_MyTable ON dbo.MyTable(MyINT)

Basically: you just have to explicitly tell the PRIMARY KEY constraint that it's NONCLUSTERED (otherwise it's created as your clustered index, by default) - and then you create a second index that's defined as CLUSTERED

This will work - and it's a valid option if you have an existing system that needs to be "re-engineered" for performance. For a new system, if you start from scratch, and you're not in a replication scenario, then I'd always pick ID INT IDENTITY(1,1) as my clustered primary key - much more efficient than anything else!

Natural answered 13/8, 2012 at 16:34 Comment(23)
This is a great answer, one thing I'd mention is that being able to generate the key before insert is frequently useful. Using "newsequentialid()" can help with the clustering, but that requires an additional round-trip to SQL. So another benefit of the "surrogate key" approach is that you can generate new ids, client-side, with fewer index fragmentation concerns.Schizoid
Just curious. Would storing the GUID as a char(32) or char(36) PK solve this? Why / why not?Tgroup
@FredLackey: no - same problem - since the key is totally random, excessive index fragmentation will happen. Just don't do it.Natural
The way I read this is that having both a non clustered uniqueidentifier column and the int identity column, FK's should also be uniqueidentifier? If you do that, when would you actually use the identity column directly, or would you not?Continent
@Natural If my GUID pk is non-clustered and I use this to join my tables (for portability across databases) then the clustered int index is redundant right? Do you foresee any problem with having no unique clustered index and doing as I describe?Pickax
@A_L: yes, a table without a clustering index is a heap - and that's really really bad for many reasons. Don't just toss your clustered index - it's important on so many levels!Natural
Little question, should the GUID now be used on joins, or the int id? My instinct tells me the GUID should be used, but I fail to see a technical problem using the int id...Hoofed
@NicolasBelley: the int is probably a bit more efficient, since it's 4x smaller in sheer size ...Natural
@Natural but in a replication scenario, if the int column is identity, shouldn't we use the GUID since the int column can repeat itself across devices?Hoofed
Great information, but as with most things, the right choice depends upon the needs of your application. I do however feel like the article would be more balanced with discussion of the GuidComb strategy. The "cheap space isn't the point" article also has some great information, but I'm not sure the test scenarios are optimal for application developers. Using Identity over generating the key app side invariably leads to round-trips. I think comparisons around typical DDD object graph persistence scenarios would help give a more accurate picture to help with the decision process.Jelly
This is an old thread, but might I add: don't just use a useless arbitrary INT as the clustering key. Use something useful like an incerementing date that is actually searched on, that has some relation to the data you're storing. You only get one clustering key, and if you choose the right one you'll get good performanceCatlett
@Natural although outdated, I totally agree with Nick.McDemermaid, but maybe I am missing something?! I am very curious about you're opinion an what Nick.McDemermaid has to say!!Berti
@Kipei: the main issues is the I-F you have such a natural value - then yes, you can use it as a primary key. BUT: values like DATETIME for instance are NOT useful for a clustering key, since they have a 3.33ms accuracy only, and thus duplicates can exist. So in such a case, you *still need an INT IDENTITY instead - therefore, I typically use that by default, since frmo my 20+ years of experience, a really usable natural key hardly ever really exists ....Natural
I currently work on EF6 Code first web application (sql server db 2012 0r 2017) and want to apply DDD concepts so I need a unique key in advance before inserting in DB, Many recommends UUID But I'm afraid of the performance issue, Could u help me please to take the right decision. Should I use GUID instead of auto increments keyKelleekelleher
@AnynameDonotcare: I'd still strongly recommend using an auto-increment INT or BIGINT. What makes you think you must know the ID value before saving?? I fail to see how DDD or any other design approach mandates this......Natural
I've learned that I should keep my model in a valid state and when creating a new instance of a class it's recomended to put all required attributes as constructor parameters. So I converted all my IDs to GUID instead of int, Should I use both of them ? one as auto increment to solve the technical issue and the GUID to solve the DDD issue ?Kelleekelleher
@Natural I've seen this method of adding a surrogate int key to cluster on discussed before. But I can't find anywhere which shows the performance gain in having a surrogate key clustered index over using a heap. Do you have any links to benchmark data? I know everyone says a heap is bad and performs worse, but also it seems there are many opinions out there.Cocksure
@DaleBurrell: check out sqlskills.com/blogs/kimberly/… - and any blog post by Kimberly Tripp for that matter - she's the "Queen of indexing" and I'm pretty sure there's performance testing numbers in her blog somewhere to show this very clearlyNatural
Thanks @Natural - I have read most of her stuff and didn't see data to back that up. But will look again. on your link she says "Oh – and if you arbitrarily add a column to use for clustering (maybe not as the primary key) that can help" - but doesn't expand on that.Cocksure
@dale_k. Sequential clustering index helps during insert because inserts occur at only the end of the linked list of leaf nodes in the storage B-tree, and helps during index scans because the logical ordering of the records matches the physical ordering of the data on disk. Ref. If your scenarios don't involve these operations, you may not benefit much from an ever-increasing clustered index key.Nunes
What purpose does creating second index that's defined as CLUSTERED serve?Shaddock
This split will unfortunately not work if you require a spatial index on the table too. From MS docs - Spatial indexes require the table to have a clustered primary key.Rufescent
If minecraft runs with guids, anything can run with guidsOutwardly
G
87

I've been using GUIDs as PKs since 2005. In this distributed database world, it is absolutely the best way to merge distributed data. You can fire and forget merge tables without all the worry of ints matching across joined tables. GUIDs joins can be copied without any worry.

This is my setup for using GUIDs:

  1. PK = GUID. GUIDs are indexed similar to strings, so high row tables (over 50 million records) may need table partitioning or other performance techniques. SQL Server is getting extremely efficient, so performance concerns are less and less applicable.

  2. PK Guid is NON-Clustered index. Never cluster index a GUID unless it is NewSequentialID. But even then, a server reboot will cause major breaks in ordering.

  3. Add ClusterID Int to every table. This is your CLUSTERED Index... that orders your table.

  4. Joining on ClusterIDs (int) is more efficient, but I work with 20-30 million record tables, so joining on GUIDs doesn't visibly affect performance. If you want max performance, use the ClusterID concept as your primary key & join on ClusterID.

Here is my Email table...

CREATE TABLE [Core].[Email] (
    [EmailID]      UNIQUEIDENTIFIER CONSTRAINT [DF_Email_EmailID] DEFAULT (newsequentialid()) NOT NULL,        
    [EmailAddress] NVARCHAR (50)    CONSTRAINT [DF_Email_EmailAddress] DEFAULT ('') NOT NULL,        
    [CreatedDate]  DATETIME         CONSTRAINT [DF_Email_CreatedDate] DEFAULT (getutcdate()) NOT NULL,      
    [ClusterID] INT NOT NULL IDENTITY,
    CONSTRAINT [PK_Email] PRIMARY KEY NonCLUSTERED ([EmailID] ASC)
);
GO

CREATE UNIQUE CLUSTERED INDEX [IX_Email_ClusterID] ON [Core].[Email] ([ClusterID])
GO

CREATE UNIQUE NONCLUSTERED INDEX [IX_Email_EmailAddress] ON [Core].[Email] ([EmailAddress] Asc)
Grazia answered 31/3, 2015 at 21:27 Comment(10)
Could you explain the PK_Email constraint? Why you have ... NonClustered(EmailID ASC) instead of ...Nonclustered(ClusterID ASC) ?Shoshana
You bet. Two main things going on with indexes: 1. Clustered on ClusterID - Orders your table on disk (0% fragmentation). 2. NonClustered on EmailID - Indexes the EmailID field to speed up GUID ID lookups. A GUID field lookup behaves string-ish, so a EmailID lookup would be slow without the index.Grazia
@RobertJ.Good I've seen this method discussed before i.e. adding a surrogate int key to cluster on. But I can't find anywhere which shows the performance gain in having a surrogate key clustered index over using a heap. Do you have any links to benchmark data?Cocksure
Hi @DaleBurrell, the clustered index is to prevent table fragmentation. Performance gain happens as the table naturally grows in order on disk, with low fragmentation.Grazia
@RobertJ.Good Is that a web application? What are you using in urls/hrefs? guid or int?Ruin
@Ruin There are security implications, so drop the newsequentialid() and expose a NewId() Guid if no other choice (definitely not the Int.) I'd recommend a claims based and/or token approach, or even brute-force encryption for any identifiers. In short, avoid exposing any Ids, and avoid any value that can be guessed, or worse +1 to find the next record.Grazia
@RobertJ.Good when you mention "In this distributed database world, it is absolutely the best way to merge distributed data." do you mean you eventually merge the records to a master database? Wondering what happens the the clusterID then, how do you handle duplicates once you merge the "source"?Sector
@Sector I would guess that the clusterID would just be an auto-generated identity both on clients and on the server. the actual value is basically meaningless so does not need to match between sides. (he mentions joining on it, which I think should still work as long as you are not joining across the client/server boundary).Metalloid
I don't really understand the clusterId thing if the point is to make like easier for merges. Won't you get conflicts when you merge tables that have this Id field?Disseminule
@NorbertNorbertson The ClusterId column would be ignored when combining/merging/joining/etc data from different database instances: ClusterId is only used for the IX_Email_ClusterID index (which basically defines how the rows are laid-out on-disk) which is an implementation concern and not part of the conceptual application logic, (so the column could be marked HIDDEN) so application logic should use only EmailID to match corresponding rows in different database instances. (Also, IMO, EmailID should use NewId(), not newsequentialid() - for too many reasons to list here).Frances
V
21

I am currently developing an web application with EF Core and here is the pattern I use:

All my classes (tables) have an int PK and FK. I then have an additional column of type Guid (generated by the C# constructor) with a non clustered index on it.

All the joins of tables within EF are managed through the int keys while all the access from outside (controllers) are done with the Guids.

This solution allows to not show the int keys on URLs but keep the model tidy and fast.

Venosity answered 12/5, 2017 at 8:14 Comment(4)
Is there anything you need to do to configure the integer pK as clustered, like data annotations, or is it just automatically configured?Youth
What the name of the property do you use for Guid one?Weigh
If you receive the Guid in the Controller, how do you access it if you don't know the associated int? do you do the sequential search in the Guid column?Raspings
@CesarAlvaradoDiaz the int PK is only used for Joins. Guid is used for accessing particular rows, such as; "Select ColumnsYouWant From AspNetUsers Where GuidId = @val". The point is that the actual int Id is never exposed to the client side, while it is used for expensive operations like table joins. You should be sure to index the GuidId column in the same way you would index a string column. The data type of GuidId must be nvarchar(36) and not Guid, so that you can index it easily. The int Id is hidden for security reasons, which is the whole point of using a second Guid Id column.Thy
D
5

This link says it better than I could and helped in my decision making. I usually opt for an int as a primary key, unless I have a specific need not to and I also let SQL server auto-generate/maintain this field unless I have some specific reason not to. In reality, performance concerns need to be determined based on your specific app. There are many factors at play here including but not limited to expected db size, proper indexing, efficient querying, and more. Although people may disagree, I think in many scenarios you will not notice a difference with either option and you should choose what is more appropriate for your app and what allows you to develop easier, quicker, and more effectively (If you never complete the app what difference does the rest make :).

https://web.archive.org/web/20120812080710/http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html

P.S. I'm not sure why you would use a Composite PK or what benefit you believe that would give you.

Dorina answered 13/8, 2012 at 16:22 Comment(3)
Totally agree!! But that means that if I have a GUID as PK or a Composite PK with GUID and other field is going to be the same right?Meakem
The PK (index) would be made up of the two columns, but unless you have some business specific reason for doing this, it seems unnecessary.Dorina
BTW this question is one of the most polarizing and debated questions out there and therefore extremely difficult to get an answer for that you will feel 100% comfortable with. Either method comes with trade-offs, so good luck :)Dorina
A
4

Well, if your data never reach millions of rows, you are good. If you ask me, i never use GUID as database identity column of any type, including PK even if you force me to design with a shotgun at the head.

Using GUID as primary key is a definitive scaling stopper, and a critical one. I recommend you check database identity and sequence option. Sequence is table independent and may provide a solution for your needs(MS SQL has sequences).

If your tables start reaching some dozens of millions of rows the most, e.g. 50 million you will not be able read/write information at acceptable timings and even standard database index maintenance would turn impossible.

Then you need to use partitioning, and be scalable up to half a billion or even 1-2 billion rows. Adding partitioning on the way is not the easiest thing, all read/write statements must include partition column (full app changes!).

These number of course (50 million and 500 million) are for a light selecting useage. If you need to select information in a complex way and/or have lots of inserts/updates/deletes, those could even be 1-2 millions and 50 millions instead, for a very demanding system. If you also add factors like full recovery model, high availability and no maintenance window, common for modern systems, things become extremely ugly.

Note at this point that 2 billion is int limit that looks bad, but int is 4 times smaller and is a sequential type of data, small size and sequential type are the #1 factor for database scalability. And you can use big int which is just twice smaller but still sequential, sequential is what is really deadly important - even more important than size - when to comes to many millions or few billions of rows.

If GUID is also clustered, things are much worst. Just inserting a new row will be actually stored randomly everywhere in physical position.

Even been just a column, not PK or PK part, just indexing it is trouble. From fragmentation perspective.

Having a guid column is perfectly ok like any varchar column as long as you do not use it as PK part and in general as a key column to join tables. Your database must have its own PK elements, filtering and joining data using them - filtering also by a GUID afterwards is perfectly ok.

Assonance answered 2/4, 2021 at 11:0 Comment(1)
I mostly agree but "It Depends". The thing is that even IDENTITY and Date Columns used for Clustered Indexes have sometimes even worse problems with fragmentation because of the horrible habit of doing an INSERT and then following that with an "ExpAnsive" update on the rows just inserted. Massive fragmentation is guaranteed and instantaneous. People have to design correctly even if they avoid Random GUIDs. Oddly enough, Random GUID Clustered Indexes about the Insert/Update fragmentation problem for months at a time instead of it being instantaneous.Tag
E
2

Having sequential ID's makes it a LOT easier for a hacker or data miner to compromise your site and data. Keep that in mind when choosing a PK for a website.

Equites answered 15/4, 2019 at 17:10 Comment(9)
Can you provide any logic or evidence to back up this claim? I'm struggling to see how a sequential id might compromise security.Colligan
Sure, if you know ID numbers are integer you can guess sequentially records in a DB. So if you query a single item, you can say that the next item is pk + 1. If you have random GUIDS, it will not follow a pattern. It would be nearly impossible to query other records than the one you previously queried (And know the PK).Equites
If a hacker can query your database you're already compromised, I fail to see how sequential id's make the situation worse.Colligan
No. That is not true. I do a lot with pen testing and am well known for catching hackers. Do I always use GUIDs and not int, no. But if I need to protect data I will rely on data techniques to protect as well as programming and there's a LOT of reasons to do this. Take this older example. When you see a URL like www.domain.com/edit/1012 that makes me cringe. The 1012 is the key of the record and can be switched out. With reactive sites this changes a bit but can still be seen. And it's complex to protect records at that point. Know your data and protect what needs to be protected.Equites
If a user can switch out 1012 for another number and see data they shouldn't then there is a very serious security issue, that issue isn't caused by the primary key choice but it is exacerbated by it. I do take your point, thank you for spelling it out.Colligan
I highlight this as a requirement for any app that deals with hypersensitive data like HIPAA or SOX. Relying on only programming as a security restraint is dangerous. It's best to use multiple methods when protecting sensitive data.Equites
You may use a GUID to locate a record at the web page, that is not the PK of the table. Using query parameter in a website should not define how you structure your DB schema. The PK has nothing to do with input and parameters in UI or backend system.Guadalcanal
This is "security by obscurity", which is not an adequate substitute for proper segregation of data security bounds. A correctly written system won't allow the owner of record 1012 to access record 1013 if they don't own it so the sequentiality matters not.Enterotomy
OWASP document on this issueXenomorphic
A
0

If you use GUID as primary key and create clustered index then I suggest use the default of NEWSEQUENTIALID() value for it.

Alvarado answered 13/8, 2012 at 16:47 Comment(2)
why would you do that?Grobe
That would generally defeat the entire purpose (security) for using a GUIDHolytide
K
0

Another reason not to expose an Id in the user interface is that a competitor can see your Id incrementing over a day or other period and so deduce the volume of business you are doing.

Kidron answered 3/3, 2021 at 0:0 Comment(1)
While your answer is technically correct, it is not an answer to the question.Almetaalmighty
P
-3

Most of the times it should not be used as the primary key for a table because it really hit the performance of the database. useful links regarding GUID impact on performance and as a primary key.

  1. https://www.sqlskills.com/blogs/kimberly/disk-space-is-cheap/
  2. https://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/
Palmira answered 18/2, 2019 at 12:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.