When should I use uuid.uuid1() vs. uuid.uuid4() in python?
Asked Answered
M

6

266

I understand the differences between the two from the docs.

uuid1():
Generate a UUID from a host ID, sequence number, and the current time

uuid4():
Generate a random UUID.

So uuid1 uses machine/sequence/time info to generate a UUID. What are the pros and cons of using each?

I know uuid1() can have privacy concerns, since it's based off of machine-information. I wonder if there's any more subtle when choosing one or the other. I just use uuid4() right now, since it's a completely random UUID. But I wonder if I should be using uuid1 to lessen the risk of collisions.

Basically, I'm looking for people's tips for best-practices on using one vs. the other. Thanks!

Michaeline answered 23/11, 2009 at 19:48 Comment(2)
Here is an alternative approach to UUID. Though the chance of collision is infinitesimal UUID doesn't guarantee uniqueness. To guarantee the uniqueness you may want to use compound key as [<system id>,<local id>]. Each system participating in data sharing must have its own unique ID of the system either assigned during system set-up or obtained from a common pool of IDs. Local id is a unique ID within any particular system. This involves more hassle but guarantees uniqueness. Sorry for the offtopic, just trying to help.Osric
Doesn't take care of the "privacy concerns" he mentionedEruct
I
310

uuid1() is guaranteed to not produce any collisions (under the assumption you do not create too many of them at the same time). I wouldn't use it if it's important that there's no connection between the uuid and the computer, as the mac address gets used to make it unique across computers.

You can create duplicates by creating more than 214 uuid1 in less than 100ns, but this is not a problem for most use cases.

uuid4() generates, as you said, a random UUID. The chance of a collision is really, really, really small. Small enough, that you shouldn't worry about it. The problem is, that a bad random-number generator makes it more likely to have collisions.

This excellent answer by Bob Aman sums it up nicely. (I recommend reading the whole answer.)

Frankly, in a single application space without malicious actors, the extinction of all life on earth will occur long before you have a collision, even on a version 4 UUID, even if you're generating quite a few UUIDs per second.

Ironsides answered 23/11, 2009 at 20:5 Comment(7)
Sorry, I commented without researching fully - there are bits reserved to keep a version 4 uuid from colliding with a version 1 uuid. I will remove my original comment. See tools.ietf.org/html/rfc4122Calamus
@gs Yeah, makes sense with what I was reading. uuid1 is "more unique", while uuid4 is more anonymous. So basically use uuid1 unless you have a reason not to. @mark ransom: Awesome answer, didn't come up when I searched for uuid1/uuid4. Straight from the horse's mouth, it seems.Michaeline
uuid1 won't necessarily produce unique UUIDs if you produce several per second on the same node. Example: [uuid.uuid1() for i in range(2)]. Unless of course something strange is going on that I'm missing.Chaplet
@Michael: uuid1 has a sequence number (4th element in your example), so unless you use up all the bits in the counter you don't have any collision.Silent
I should have actually tested. They just happened to look identical. But I have run into a collision before with a snippet similar to the above and a number larger than 2.Chaplet
@Michael: I've tried researching the circumstances when collisions happen and have added the information I found.Silent
If correctly implemented uuid4 will produce less collisions - not more. The risks that the server dies or the whole data center explodes is higher than you get a collision in the generated uuid4. uuid4 is however probably slower than uuid1.Gillett
C
45

My team just ran into trouble using UUID1 for a database upgrade script where we generated ~120k UUIDs within a couple of minutes. The UUID collision led to violation of a primary key constraint.

We've upgraded 100s of servers but on our Amazon EC2 instances we ran into this issue a few times. I suspect poor clock resolution and switching to UUID4 solved it for us.

Coax answered 21/12, 2015 at 8:51 Comment(0)
F
40

One instance when you may consider uuid1() rather than uuid4() is when UUIDs are produced on separate machines, for example when multiple online transactions are process on several machines for scaling purposes.

In such a situation, the risks of having collisions due to poor choices in the way the pseudo-random number generators are initialized, for example, and also the potentially higher numbers of UUIDs produced render more likely the possibility of creating duplicate IDs.

Another interest of uuid1(), in that case is that the machine where each GUID was initially produced is implicitly recorded (in the "node" part of UUID). This and the time info, may help if only with debugging.

Fiume answered 23/11, 2009 at 20:18 Comment(1)
The probability that 126 bits of true random collide is extremely low. Actually so low that I believe it doesn't matter.Gillett
S
11

One thing to note when using uuid1, if you use the default call (without giving clock_seq parameter) you have a chance of running into collisions: you have only 14 bit of randomness (generating 18 entries within 100ns gives you roughly 1% chance of a collision see birthday paradox/attack). The problem will never occur in most use cases, but on a virtual machine with poor clock resolution it will bite you.

Sweetandsour answered 17/6, 2014 at 18:49 Comment(3)
@Guilaume it would be really useful to see an example of good practice using clock_seq....Regale
@Guilaume How have you calculated this chance of 1%? 14 bits of randomness means the collision will guaranteed to happen if you generate >= 2^14 ids per 100ns and this means that 1% chance of a collision is when you produce roughly 163 ids per 100 nsEvensong
@Evensong As I said, you should look at the birthday paradox.Sweetandsour
P
7

Perhaps something that's not been mentioned is that of locality.

A MAC address or time-based ordering (UUID1) can afford increased database performance, since it's less work to sort numbers closer-together than those distributed randomly (UUID4) (see here).

A second related issue, is that using UUID1 can be useful in debugging, even if origin data is lost or not explicitly stored (this is obviously in conflict with the privacy issue mentioned by the OP).

Payne answered 13/4, 2017 at 16:46 Comment(0)
L
6

In addition to the accepted answer, there's a third option that can be useful in some cases:

v1 with random MAC ("v1mc")

You can make a hybrid between v1 & v4 by deliberately generating v1 UUIDs with a random broadcast MAC address (this is allowed by the v1 spec). The resulting v1 UUID is time dependant (like regular v1), but lacks all host-specific information (like v4). It's also much closer to v4 in it's collision-resistance: v1mc = 60 bits of time + 61 random bits = 121 unique bits; v4 = 122 random bits.

First place I encountered this was Postgres' uuid_generate_v1mc() function. I've since used the following python equivalent:

from os import urandom
from uuid import uuid1
_int_from_bytes = int.from_bytes  # py3 only

def uuid1mc():
    # NOTE: The constant here is required by the UUIDv1 spec...
    return uuid1(_int_from_bytes(urandom(6), "big") | 0x010000000000)

(note: I've got a longer + faster version that creates the UUID object directly; can post if anyone wants)


In case of LARGE volumes of calls/second, this has the potential to exhaust system randomness. You could use the stdlib random module instead (it will probably also be faster). But BE WARNED: it only takes a few hundred UUIDs before an attacker can determine the RNG state, and thus partially predict future UUIDs.

import random
from uuid import uuid1

def uuid1mc_insecure():
    return uuid1(random.getrandbits(48) | 0x010000000000)
Languedoc answered 18/8, 2017 at 18:9 Comment(6)
Seems like this method is "like" v4 (host-agnostic), but worse (less bits, dependence on urandom, etc). Are there any advantages compared to just uuid4?Michaeline
This is primarily just an upgrade for cases where v1 is useful for it's time-based qualities, yet stronger collision resistance and host-privacy is desired. One example is as a primary key for a database - compared to v4, v1 uuids will have better locality when writing to disk, have a more useful natural sort, etc. But If you've got a case where an attacker predicting 2**61 bits is a security issue (e.g. as uuid a nonce), then $diety yes, use uuid4 instead (I know I do!). Re: being worse because it uses urandom, I'm not sure what you mean - under python, uuid4() also uses urandom.Languedoc
Good stuff, that makes sense. It's good to see not just what you can do (your code), but also why you'd want it. Re: urandom, I mean that you're consuming 2x the randomness (1 for uuid1, another for the urandom), so could use up system entropy quicker.Michaeline
It's actually about half as much as uuid4: uuid1() uses 14 bits for clock_seq, which rounds up to 2 bytes of urandom. The uuid1mc wrapper uses 48 bits, which should map to 6 bytes of urandom, for a total of urandom(8) consumed per call. whereas uuid4 directly invokes urandom(16) for every call.Languedoc
Apparently a lot of hosting providers (e.g. AWS) use the same MAC address for very large numbers of hosted VMs. Container-based deployments like Kubernetes seem to have similar problems. In that case, it seems like you might want to take this "Random MAC UUID1" approach, generating a random (static?) MAC address for each container/instance, assuming you want the "time locality" property of UUID1 as opposed to completely random UUID4. Does that seem reasonable?Grady
Also, I found the relevant portion of RFC 4122: 4.5. Node IDs that Do Not Identify the HostGrady

© 2022 - 2024 — McMap. All rights reserved.