ULID datatype in SQL Server database
Asked Answered
B

1

7

While searching for UUIDs I stumbled upon ULIDs https://github.com/ulid/spec

And I want to use them in my SQL Server database as primary key for my users (instead of basic incremental value).

So I wanted to ask:

  1. What datatype is the best suited? uniqueidentifier? binary16? something else? and why?

  2. Are there some other extra steps to fully benefit from ULID structure? especially sorting/searching? in comparison to other UUIDs.

Billposter answered 15/7, 2022 at 20:17 Comment(6)
Do you have any reason at all to not just use an identity specification, which I guess is what you meant by mentioning "basic incremental value"? If yes, tell us these reasons so we can help you. If no, go for a bigint identity.Heptateuch
The reason not to use identity is quite simple. Security. I want accountId be an unique specified which client will have in JWT (between API and client app). And from identity you can 1) find out how many users were already registered by creating new account and checking that value 2) do iteration attacks (depends on implementation), so thats why I want to use some kind of UUID instead of int/bigint identity and through those UUID docs I came to ULID. I use identity on other columns that are not exposed to client side.Billposter
Fair reasoning. As food for thought, I will argue that a bigint starting at 1 billion, or some other large-yet-arbitrary number will provide false, thus dubious, information about total rows.Heptateuch
Thought about that as well, from this approach you are still not protected against iteration attacks/iteration bruteforcing and also you for example are able to see daily/weekly increment in users. By so many things you have to take in consideration I ended up just with UUIDs and the ULID seems to be the most promising one. But I havent found much about its implementation online.Billposter
There is one mistake in your reasoning: Getting the largest value of an identity column does not tell you how many users have registered. Identity columns are not guaranteed to be sequential, and in fact will nearly always have gaps. Moreover, you don't have to seed the identity at 1. The "first" identity value can be any integer you like. But yes, security-by-obscurity may be of some benefit to guard against iteration.Pozzy
for ULID in sql server , you should use binary(16). otherwise, if you use uniqueidentifier, then IDataReader will convert value to GUID , not ULIDWireman
E
4

You are after a 16-byte, universally unique, chronologically sortable unique identifier.

SQL Server already has that: uniqueidentifier.

Many people are under the mistaken impression that a uniqueidentifer is a random value, and that its use as a clusting key is bad because it causes I/O to be scattered all over the place, torn pages, and poor temporaral locality.

That is only true for "random" uuids - known as a "type 4" uuids.

Sequential sortable UUIDs

But there are other types of UUIDs:

  • Type 1: stuffs MAC address+datetime into 128 bits
  • Type 3: stuffs an MD5 hash into 128 bits
  • Type 4: stuffs random data into 128 bits
  • Type 5: stuffs an SHA1 hash into 128 bits
  • Type 6: sequential sortable UUIDs

Rather than being built from random data, or an MD5 hash, or an SHA1 hash, we can use a hash based on a timestamp , with a resolution of 100 ns (i.e. 0.0001 ms).

And similarly to how you can default a uniqueidentifier colunm to a "type 4 random uuid":

CustomerUUID uniqueidentifier DEFAULT newid()

SQL Server also supports sequential type-1 UUIDs:

CustomerUUID uniqueidentifier DEFAULT newsequentialid()

And those uuids sort chronologically - as long as they're created on the same machine

NodeID, MAC address, and Multiple Machines

There is the issue that the type-1 UUID contains a 6-byte nodeID, which is the client's MAC address.

00112233-4455-6677-8899-AABBCCDDEEFF

Timestamp-High Timestamp-Mid Timestamp-Low Sequence NodeID
00112233 4455 6677 8899 AABBCCDDEEFF

And SQL Server, for whatever reason, decided to sort UUIDs by the nodeID first.

SQL Server sorts guids first by f, then e, then d, c, b, a, 9, ..., 2, 1, 0:

{00112233-4455-6677-9988-ffeeddccbbaa}

In other words, it sorts by:

  • Node[0]
  • Node[1]
  • Node[2]
  • Node[3]
  • Node[4]
  • Node[5]
  • UInt16 Sequence (little endian)
  • UInt16 TimestampLow (big endian)
  • UInt16 TimestampMid (big endian)
  • UInt32 TimestampHigh (big endian)

Which means:

  • if all the UUIDs are generated by the same machine (whether it be the database server itself, or a web web-server) then the UUIDs will be sequential based on the timestamp
  • but if the UUIDs are generated on different clients, with different nodeIDs (i.e. MAC address) then you'll have them sorted first by node, and then by timestamp

That may or may not be an issue for you. But if it is, the solution is to shuffle around the bytes of the uniqueidentifier/UUID so that:

  • if SQL Server is gonna sort the 6-byte NodeID first
  • then i'll just put the timestamp there
  • and move the nodeID to where the timestamp was

In C# we use the following class to generate an "SQL Server sortable UUID". Other databases may sort UUIDs/uniqueidentifiers differently.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Web;

/// <summary>
/// Summary description for SortableGuid
/// </summary>
public class SortableGuid
{
    //UUID timestamps are 100ns ticks starting from October 15, 1582 (date we switched to Gregorian calendar)
    //Windows FILETIME is 100ns ticks starting from January 1, 1601  (the date of the start of the first 400-year cycle of the Gregorian calendar)
    private static readonly Int64 guidEpochOffset = -5748192000000000; //6653 days --> 159672 hours --> 9580320 minutes --> 574819200 seconds -> 574819200000 ms -> 574819200000000 us --> 5748192000000000 ticks

    private static Int64 _lastTick = DateTime.UtcNow.Ticks + guidEpochOffset;

    /// <summary>
    /// 
    /// </summary>
    /// <param name="location">The destination, whose value is compared with comparand and possibly replaced.</param>
    /// <param name="comparison">The value that is compared to the value at location1.</param>
    /// <param name="newValue">The value that replaces the destination value if the comparison results in equality.</param>
    /// <returns>true if comparand was greater than location, and location was updated to newValue. 
    /// Otherwise false.</returns>
    public static Boolean InterlockedExchangeIfGreaterThan(ref Int64 location, Int64 newValue, Int64 comparand)
    {
        //Thank you Raymond Chen: https://mcmap.net/q/467377/-interlocked-compareexchange-lt-int-gt-using-greaterthan-or-lessthan-instead-of-equality
        Int64 currentValue;
        do
        {
            currentValue = Interlocked.Read(ref location); //a read of a 64-bit location is not atomic on x86. If location was 32-bit you could get away with just "currentValue = location;"
            if (currentValue >= comparand) return false;
        }
        while (System.Threading.Interlocked.CompareExchange(ref location, newValue, currentValue) != currentValue);
        return true;
    }

    /// <summary>
    /// Returns a new sortable guid. These guid's are suitable as clustering keys in SQL Server.
    /// </summary>
    /// <returns></returns>
    public static Guid NewGuid()
    {
        /*
            Blatently borrowed from Entity Framework.
                https://github.com/dotnet/efcore/blob/master/src/EFCore/ValueGeneration/SequentialGuidValueGenerator.cs


            Except with two differences:
                - they start with an initial timetime, generated statically once - and keep increasing that. 
                    That causes the "timestamp" to drift further and further from reality. 
                    We generate a timestamp each time, and only rely on incrementing if we're not greater than the last timestamp. 
                    A CPU is capable of generating ~200k GUIDs per 100ns tick - so it's not like we can ignore the duplciate ticks problem.
                - UUID timestamp ticks start at October 15, 1582 (date of gregorian calendar). 
                    Windows, and DateTime.Ticks returns number of ticks since January 1, 1601 (the date of the first 400 year Gregorian cycle). 
                    We'll offset the timestamp to match the UUID spec.
                - We place the version number type-7: Sortable by SQL Server with a timestamp.

            SQL Server sorts first by the NodeID (i.e. the final 6-bytes):

            16-byte guid                                            Microsoft clsid string representation
            ===========--=====--=====--=====--=================     ======================================
            33 22 11 00  55 44  77 66  99 88  ff ee dd cc bb aa ==> {00112233-4455-6677-9988-ffeeddccbbaa}
            \_______________________/  \___/  \_______________/
            Timestamp and Version  ^    Clk        Node ID

            The goal is to create a sequential uuid (e.g. UuidCreateSequential), but without having to rely on a MAC address. 
            (i.e. Does an Azure web-site even **have** a MAC address? 
            We certainly can't P/Invoke out to UuidCreateSequental when we're not running on Windows)

            So we conceptually move the 8-byte timestamp to it's new location in NodeID+ClockSequence

            And what used to be Timestamp+Version becomes random.

            And, like type-4 Uuids being called Type-4 to help reduce the chances of collisions between types,
            we call this new version type-7 (sortable by SQL Server with a timestamp).
        */

        //Start from a new random (type 4) uuid. 
        Byte[] guidBytes = Guid.NewGuid().ToByteArray();

        //Generate 8-byte timestamp
        Int64 currentTicks = DateTime.UtcNow.Ticks + guidEpochOffset;
        //Make sure that currentTicks is greater than the last tick count (in case they're generating guids faster than 100ns apart)
        Int64 last;
        do
        {
            last = Interlocked.Read(ref _lastTick); //a read of a 64-bit value isn't atomic; so it has to be interlocked.
            if (currentTicks <= last)
                currentTicks = last + 1;
        } while (Interlocked.CompareExchange(ref _lastTick, currentTicks, last) != last);


        Byte[] counterBytes = BitConverter.GetBytes(currentTicks);

        if (!BitConverter.IsLittleEndian)
        {
            Array.Reverse(counterBytes);
        }

        //This Guid started it's life as a Type 4 (Random) guid, but the final 8 bytes were changed to contain a sequential counter.
        //I want to tag this as a different kind of Uuid algorithm. In Delphi we already called this a Type 7 uuid.
        guidBytes[07] = (Byte)(0x70 | (guidBytes[07] & 0x0f));

        //Put the timestamp in place - in the proper SQL Server sorting order.
        guidBytes[08] = counterBytes[1];
        guidBytes[09] = counterBytes[0];
        guidBytes[10] = counterBytes[7];
        guidBytes[11] = counterBytes[6];
        guidBytes[12] = counterBytes[5];
        guidBytes[13] = counterBytes[4];
        guidBytes[14] = counterBytes[3];
        guidBytes[15] = counterBytes[2];

        Guid result = new Guid(guidBytes);
        return result;
    }
}
Ellita answered 29/3 at 16:8 Comment(4)
So, If I am following. IF you want to use a ULID, you can generate it, convert it to a UUID, and then store it in a uniqueidentifier field?Suicide
@Suicide A uniqueidentifier is a UUID. uniqueidentifier is an alias for UUID. We create a uuid/uniqueidentifier, and then perform (essentially) endian conversion.Ellita
@IanBoyd Ok, I understand that UUID = uniqueidentifier. But when creating ULID in my backend and then transforming it to GUID/UUID and storing it as uniqueidentifier I should have all benefits of Sequential sortable UUIDs right ?Billposter
Yes, you end up with UUID/uniqueidentifer that sorts chronologically.Ellita

© 2022 - 2024 — McMap. All rights reserved.