How good is Java's UUID.randomUUID?
Asked Answered
B

9

364

I know that randomized UUIDs have a very, very, very low probability for collision in theory, but I am wondering, in practice, how good Java's randomUUID() is in terms of not having collision? Does anybody have any experience to share?

Bignoniaceous answered 25/3, 2010 at 6:51 Comment(9)
In my experience, I have never seen a collision ;-)Douma
The algorithms are specified in RFC1422: ietf.org/rfc/rfc4122.txtTelfore
@skaffman: the RFC says absolutely nothing about the algorithm used to generate the random digits.Hamster
Since this is a more open ended question, I guess I won't mark any answer as the correct answer; instead, I will give one vote to each of the answers that I think is good :)Bignoniaceous
From wikipedia: ...In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%.Jackboot
We have experienced 2 collisions to Date. I do not know how we managed to make them :(Latta
So the bottom line is that a collision is possible. It is possible that the first 2 UUID's you generate will be identical. Unlikely, yes.Lowboy
@Jackboot it's "at least one" not "just one"Billet
If you don't need to generate two ids on the same second, it's 100% secure to the current date + time.Hydrophilous
H
194

UUID uses java.security.SecureRandom, which is supposed to be "cryptographically strong". While the actual implementation is not specified and can vary between JVMs (meaning that any concrete statements made are valid only for one specific JVM), it does mandate that the output must pass a statistical random number generator test.

It's always possible for an implementation to contain subtle bugs that ruin all this (see OpenSSH key generation bug) but I don't think there's any concrete reason to worry about Java UUIDs's randomness.

Hamster answered 25/3, 2010 at 10:43 Comment(7)
"It's always possible for an implementation to contain subtle bugs ..." - Or (donning tin-foil hat) ... deliberate subtle flaws. <:-)Manyplies
Cryptographic strength is completely irrelevant for the question of collisions.Hypanthium
@osa: Not producing collisions (more than to be expected from perfect randomness) is pretty much the lowest quality requirement for a RNG, while cryptographic strength is the highest. In other words, a cryptographically strong RNG will most definitely not produce more collisions than expected.Hamster
It might be useful to note, though, that if you e.g. run a JVM churning out UUIDs inside blogs.vmware.com/cto/…, you will probably get many, many collisions. All software RNGs are PRNGs, and they're ultimately only as good as their source of entropy; two PRNGs that get seeded identically will also behave identically, and that can happen surprisingly often with consistent, exact-duplicate server setups and startup procedures.Blavatsky
@user508633: I would actually expect to get a 100% collision rate in that specific case, but it's a very specific case indeed that goes far beyond "consistent, exact-duplicate server setups and startup procedures". I'm pretty sure you would not get any increased collision rates if you merely cloned a VM and ran it normally. The self-seeding of SecureRandom tries pretty hard to get some real entropy, to the point of blocking execution if it can't find any: seancassidy.me/wiggle-the-mouse-to-fix-the-test.htmlHamster
True, though in some sense this exchanges one egg for another (we've offloaded the problem from the JVM to the Linux kernel). This is fine, as long as you trust the entropy pool estimation in the Linux kernel to be accurate (natural collisions would still be vanishingly likely, but a bug there could erode on security vs. a dedicated adversary).Blavatsky
Note that the SecureRandom CSPRNG is by default seeded by a truly random source on mac/linux/windows. The random source is /dev/random on mac/linux or CryptGenRandom on windows. These truly random sources use things like keyboard/mouse/hardware event timing low-bits. The truly random bits are a limited resource so a CSPRNG is seeded from it which can then generate unlimited random data.Batfish
C
129

Wikipedia has a very good answer http://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions

the number of random version 4 UUIDs which need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion, computed as follows:

...

This number is equivalent to generating 1 billion UUIDs per second for about 85 years, and a file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes, many times larger than the largest databases currently in existence, which are on the order of hundreds of petabytes.

...

Thus, for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.

Cambogia answered 30/8, 2010 at 13:11 Comment(10)
I'd also quote from that page, "The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs."Mastaba
This is only true for true randomness, not for pseudorandom numbers like javas UUIDs.Circumvolution
@Markus: completely wrong. Probability of collisions for good pseudorandom RNGs' especially cryptographically strong ones, is no different than "true" randomness.Hamster
Wikipedia is wrong, in reality v4 ID's collide much more than this. Write a program and prove it to yourself.Lowboy
@Eric - I think the onus is on you to back up your assertion. FWIW, the only scenarios I can think of where type 4 UUIDs would collide more frequently that the probability theory says they should are: 1) a bad source of crypto random numbers, or 2) a UUID library that has been compromised.Manyplies
This does not answer the question asked. The question is about the quality of the randomness in Java's UUID.randomUUID(), not about the theoretical chances for a given perfect random number generator.Scintillometer
@Scintillometer - It is not possible to make definitive fact-based statements about the quality of randomness of the Java random number generators. See other answers and comments to understand the reasons why. In that light, the Wikipedia answer is a good one ... and as good as it possible.Manyplies
@StephenC, technically, that is incorrect. You can use statistics with a 99% confidence level, then you expect to see zero collisions in many years. Thus, even a single collision would be "more frequently than the probability theory says", yet there can be no compromised library/random number generator responsible for that.Hypanthium
Practically, how are you going to generate random numbers? Using current time and the network card number? Then two people with the same card numbers (manufacturer error) running the program at the same millisecond will get the same time, no matter what your super-smart-cryptographic algorithm is. Really, this issue has nothing to do with cryptographically strong functions.Hypanthium
@osa - That is true. However, it doesn't really help. Statistical testing is not going to tell you if your generator's implementation has been "tweaked". And the tweak could be such as to (deliberately) cause collisions ... under certain conditions that are unknown to us. If we posit that "bad hats" may have interfered with something, then statistics don't tell us anything.Manyplies
M
74

Does anybody have any experience to share?

There are 2^122 possible values for a type-4 UUID. (The spec says that you lose 2 bits for the type, and a further 4 bits for a version number.)

Assuming that you were to generate 1 million random UUIDs a second, the chances of a duplicate occurring in your lifetime would be vanishingly small. And to detect the duplicate, you'd have to solve the problem of comparing 1 million new UUIDs per second against all of the UUIDs you have previously generated1!

The chances that anyone has experienced (i.e. actually noticed) a duplicate in real life are even smaller than vanishingly small ... because of the practical difficulty of looking for collisions.

Now of course, you will typically be using a pseudo-random number generator, not a source of truly random numbers. But I think we can be confident that if you are using a creditable provider for your cryptographic strength random numbers, then it will be cryptographic strength, and the probability of repeats will be the same as for an ideal (non-biased) random number generator.

However, if you were to use a JVM with a "broken" crypto- random number generator, all bets are off. (And that might include some of the workarounds for "shortage of entropy" problems on some systems. Or the possibility that someone has tinkered with your JRE, either on your system or upstream.)


1 - Assuming that you used "some kind of binary btree" as proposed by an anonymous commenter, each UUID is going to need O(NlogN) bits of RAM memory to represent N distinct UUIDs assuming low density and random distribution of the bits. Now multiply that by 1,000,000 and the number of seconds that you are going to run the experiment for. I don't think that is practical for the length of time needed to test for collisions of a high quality RNG. Not even with (hypothetical) clever representations.

Manyplies answered 25/3, 2010 at 10:33 Comment(3)
"(And to detect the duplicate, you'd have to solve the problem of comparing 1 million new UUIDs per second against all of the UUIDs you have previously generated!)" - that part is relatively straightforward assuming you have stored your uuids in some kind of binary tree structure, it would just be one tree descent per new uuid. You wouldn't need to actually compare it individually against all the previously generated uuids.Gerber
"the chances of a duplicate occurring in your lifetime would be vanishingly small", what will future generations condemn us for...Hydrophilous
Maybe they would condemn us for the assumption that generating 10^6 unique ids per second is unrealistic. Or 10^12. Or that maybe not realizing there is no such thing as a random number. Or something else that is hard imagine. But ... hey ... people have been passing judgement on their ancestors since forever, and it hasn't changed anything.Manyplies
C
21

I'm not an expert, but I'd assume that enough smart people looked at Java's random number generator over the years. Hence, I'd also assume that random UUIDs are good. So you should really have the theoretical collision probability (which is about 1 : 3 × 10^38 for all possible UUIDs. Does anybody know how this changes for random UUIDs only? Is it 1/(16*4) of the above?)

From my practical experience, I've never seen any collisions so far. I'll probably have grown an astonishingly long beard the day I get my first one ;)

Collectivism answered 25/3, 2010 at 8:22 Comment(2)
From wikipedia: ...In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%.Jackboot
Actually wikipedia says it's for the next 85 years...I say don't count on it, someone somewhere has generated the same UUID as youYardage
O
13

At a former employer we had a unique column that contained a random uuid. We got a collision the first week after it was deployed. Sure, the odds are low but they aren't zero. That is why Log4j 2 contains UuidUtil.getTimeBasedUuid. It will generate a UUID that is unique for 8,925 years so long as you don't generate more than 10,000 UUIDs/millisecond on a single server.

Ofeliaofella answered 9/3, 2017 at 17:50 Comment(7)
Yes. But the question is asking about random (i.e. type-4) UUIDs.Manyplies
It is asking about the likelihood of getting a collision. The implication is that he wants to be sure to avoid them.Ofeliaofella
(The collision was most likely due to a broken source of randomness for the seeding of the PRNGs. Thought I guess that it was possible that it was due to pure chance.)Manyplies
logging.apache.org/log4j/log4j-2.2/log4j-core/apidocs/org/… -> the idea is to use the time in resolution of 100ns in the uuid as well as the MAC-adress. So if you generate two UUIDs within the same 100ns-interval, they are not guaranteed to be random, they are in fact more likely to be equal than UUIDs with more random bits. I do not know how likely a collision is, but the implication that no collisions can possibly occur as long as the 10,000UUIDs/ms limit is obeyed is wrong if my understanding is correct.Hunkydory
@Hunkydory you must not looked at the code. There is also a counter that rolls over every 10,000 uuids. That is why it is guaranteed to be unique so long as there aren’t more than 10,000 per ms.Ofeliaofella
@Ofeliaofella you are right, there is both a counter that gets incremented on each uuid-generation and a counter for the number of time-slices that have passed, so my concerns are not valid. logging.apache.org/log4j/log4j-2.4/log4j-core/apidocs/src-html/…Hunkydory
If I delete my wrong comment the useful comment of rgoers would lose context, if I do not delete it someone might only read the highlighted part and get the wrong idea. Wish I could edit it.Hunkydory
S
12

Many of the answers discuss how many UUIDs would have to be generated to reach a 50% chance of a collision. But a 50%, 25%, or even 1% chance of collision is worthless for an application where collision must be (virtually) impossible.

Do programmers routinely dismiss as "impossible" other events that can and do occur?

When we write data to a disk or memory and read it back again, we take for granted that the data are correct. We rely on the device's error correction to detect any corruption. But the chance of undetected errors is actually around 2-50.

Wouldn't it make sense to apply a similar standard to random UUIDs? If you do, you will find that an "impossible" collision is possible in a collection of around 100 billion random UUIDs (236.5).

This is an astronomical number, but applications like itemized billing in a national healthcare system, or logging high frequency sensor data on a large array of devices could definitely bump into these limits. If you are writing the next Hitchhiker's Guide to the Galaxy, don't try to assign UUIDs to each article!

Soso answered 24/9, 2017 at 21:4 Comment(1)
As a point of comparison, the chance of winning a Powerball jackpot is 1 in 300 million, but sales of 10 to 20 million tickets are typical. The point is that many people define "impossible" as something less than one chance in hundreds of millions.Soso
B
11

The original generation scheme for UUIDs was to concatenate the UUID version with the MAC address of the computer that is generating the UUID, and with the number of 100-nanosecond intervals since the adoption of the Gregorian calendar in the West. By representing a single point in space (the computer) and time (the number of intervals), the chance of a collision in values is effectively nil.

Boutin answered 25/9, 2015 at 11:53 Comment(4)
This explanation makes me optimistic not to see collisions in practice. Can you point to any reference for this statement (some source code would be even better)?Porker
Found this in specs ietf.org/rfc/rfc4122.txt. Nevertheless would be great to see implementation.Porker
That scheme is not what Java implements, however. Java implements type 4 UUID, which is pure random and does not include MAC address or time. Incidentally, since there are now many physical and virtual devices where you can choose your MAC address, the original algorithm does not guarantee uniqueness.Laux
This schema is not what java implements in the standard lib, but it is implemented in java by log4j2: logging.apache.org/log4j/log4j-2.2/log4j-core/apidocs/org/…Hunkydory
F
4

I play at lottery last year, and I've never won .... but it seems that there lottery has winners ...

doc : https://www.rfc-editor.org/rfc/rfc4122

Type 1 : not implemented. collision are possible if the uuid is generated at the same moment. impl can be artificially a-synchronize in order to bypass this problem.

Type 2 : never see a implementation.

Type 3 : md5 hash : collision possible (128 bits-2 technical bytes)

Type 4 : random : collision possible (as lottery). note that the jdk6 impl dont use a "true" secure random because the PRNG algorithm is not choose by developer and you can force system to use a "poor" PRNG algo. So your UUID is predictable.

Type 5 : sha1 hash : not implemented : collision possible (160 bit-2 technical bytes)

Friedrich answered 23/4, 2015 at 6:36 Comment(1)
The probability of winning the lottery is maybe one in 10 or 100 million (10^7 or 10^8) or something like that. The probability of a collision with an 128 bit random number is 3.4 * 10^28. Give me a lottery ticket any time!Manyplies
K
1

We have been using the Java's random UUID in our application for more than one year and that to very extensively. But we never come across of having collision.

Kink answered 11/11, 2014 at 21:3 Comment(1)
Could you be more specific about how many UUIDs have been generated?Matthews

© 2022 - 2024 — McMap. All rights reserved.