Why does using NEWID() use more space than NEWSEQUENTIALID()?
Asked Answered
I

1

6

I'm in the middle of doing a bit of research, and I've come across an anomaly that I can't explain (and I've not been able to find anything on Google). Consider the following SQL:

CREATE TABLE MyGuid (
    ID UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID()
)
GO
CREATE TABLE MyGuidSeq (
    ID UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWSEQUENTIALID()
)
GO

DECLARE @i INT, @noRecords INT
SET @noRecords = 1000000

-- MyGuid
SET @i = 1
WHILE (@i <= @noRecords)
    BEGIN
        INSERT INTO MyGuid DEFAULT VALUES
        SET @i = @i + 1
    END

-- MyGuidSeq
SET @i = 1
WHILE (@i <= @noRecords)
    BEGIN
        INSERT INTO MyGuidSeq DEFAULT VALUES
        SET @i = @i + 1
    END

exec sp_spaceused 'MyGuid', true
exec sp_spaceused 'MyGuidSeq', true

Results:

Table name  No. Rows   Reserved Space  Actual space  Index Size  Unused Space
MyGuid      1,000,000  34,760 KB       34,552 KB     160 KB      48 KB
MyGuidSeq   1,000,000  24,968 KB       24,768 KB     176 KB      24 KB

Question

Can anyone explain why the reserved / actual space is considerably smaller when using NEWSEQUENTIALID() over NEWID()?

In Response To Answer

I ran the following test to check on the answer I was given by Luaan below:

CREATE TABLE MyGuid (
    ID UNIQUEIDENTIFIER DEFAULT NEWID()
)
GO
CREATE TABLE MyGuidSeq (
    ID UNIQUEIDENTIFIER DEFAULT NEWSEQUENTIALID()
)
GO

DECLARE @i INT, @noRecords INT
SET @noRecords = 1000000
-- MyGuid
SET @i = 1
WHILE (@i <= @noRecords)
    BEGIN
        INSERT INTO MyGuid DEFAULT VALUES
        SET @i = @i + 1
    END

-- MyGuidSeq
SET @i = 1
WHILE (@i <= @noRecords)
    BEGIN
        INSERT INTO MyGuidSeq DEFAULT VALUES
        SET @i = @i + 1
    END

exec sp_spaceused 'MyGuid', true
exec sp_spaceused 'MyGuidSeq', true

The difference being is that I've removed the primary key (meaning the table is now a heap table). This now results in exactly the same sizes for both tables. This proves that the table is physically organized by its clustered index.

name       rows    reserved  data      index_size  unused
MyGuid     1000000 25992 KB  25976 KB  8 KB        8 KB
MyGuidSeq  1000000 25992 KB  25976 KB  8 KB        8 KB
Incase answered 7/1, 2014 at 14:23 Comment(3)
Did you try reading the documentation for NEWSEQUENTIALID? "You can use NEWSEQUENTIALID() to generate GUIDs to reduce page contention at the leaf level of indexes."Arraignment
Yes - but that refers to indexes. The documentation on sp_spaceused says that "reserved" and "actual" space relate to the table, rather than the indexes.Incase
But the table is stored in a clustered index.Arraignment
A
4

This is related to partitioning. Basically, newId() will create GUIDs in random order, which means that you're inserting into the middle of the table all the time. Sequential IDs, on the other hand, will always append to the end of the table, which is much simpler.

If you want to know more, look at some materials on paging. A good start might be the official MSDN page on MS SQL paging - http://technet.microsoft.com/en-us/library/ms190969(v=sql.105).aspx

You also have to understand that rows are inherently organized by ID in the physical files that store the database data. A file with no spaces between IDs (such as when using identity columns and no deletion) can take less space to store the same amount of data.

I'd expect that a full shrink of the database will significantly reduce the amount of space lost to fragmentation in MyGuid table, while it will do very little to MyGuidSeq size.

If you can use sequential GUIDs, do so - they improve INSERT efficiency a lot, and by extension, indices can also be less fragmented and smaller overall.

You're not showing the "time taken" debug outputs, but I expect that those are significantly different as well (even though this can be very much offset by the memory available to the database - it doesn't need to change the data files immediately; if you want to know more about this, look up something about transaction logs).

Aloe answered 7/1, 2014 at 14:27 Comment(8)
I'm not using this code for anything, my research is for a blog post I'm writing :) I was aware that what you have explained was the case for any relevant indexes, but I didn't realise that tables were organised in a similar fashion. I was under the impression that newly inserted data was put at the end of the file, and the clustered index was a separate entity, which stored a physical representation of the index data.Incase
@Incase I'm sure there's database engines that always insert rows sequentially, but MS SQL isn't one of them - it still has something like a B-tree based on the primary key. This is also the reason why tables without primary keys are so incredibly slow, and it's also the reason why you can only have a single primary key, and also why you can't UPDATE the primary key columns - it's not a separate index, it's how the rows are physically stored, both on disk and in memory.Aloe
The reason you can only have a single primary key is a historic accident - there really should be no distinction between primary and other unique key constraints. If you'd said "clustered index" instead, your point would be well made. PK and clustered index are orthogonal concepts. Similarly, you can update PK columns (and CI columns if they're separate) but you probably shouldn't.Arraignment
@Damien_The_Unbeliever: I wouldn't say it is a historic accident. It has it's reasons, and it makes it much easier to work with general tables (you can rely on the fact that each row can be uniquely identified, which vastly simplifies a lot of things). And ah, you're right, you can update the column value. Either I carry outdated information in my head, or maybe I assumed that it doesn't allow you to do the update because I've often tried it on identity columns...Aloe
@Aloe - for any general table design, multiple candidate keys may be identified. each of those keys ought to be declared (via PK or unique), and any of those keys can be used to uniquely identify a row. Those keys should be treated equally, but due to a historic accident, one of those keys is "more equal" than the others. (Some purists do go so far as to avoid declaring PKs on their tables and only using Unique constraints, so that all the keys are on an equal footing, but I think that's taking theory too far)Arraignment
This makes total sense now I've had a bit of a further look into it. I've just ran the same test on two heap tables, and the sizes are exactly the same.Incase
@Arraignment Well, you are right that this is somewhat arbitrary. Personally, however, I'd be much happier if unique keys other than the primary key would allow NULL values to ignore the uniqueness-constraint, eg. you could have an unique e-mail column, but it would only constraint rows which have the column filled in. Since this isn't how it works, it lends more weight to your opinion. However, don't forget that there needs to be some primary column, to use for reference in other indices etc. Not to mention that PK is a logical label, while clustered index is about data structure.Aloe
Nothing to do with either partitioning or paging. To do with fragmentation and pages though.Tenerife

© 2022 - 2024 — McMap. All rights reserved.