Math question regarding Python's uuid4
Asked Answered
T

3

6

I'm not great with statistical mathematics, etc. I've been wondering, if I use the following:

import uuid
unique_str = str(uuid.uuid4())
double_str = ''.join([str(uuid.uuid4()), str(uuid.uuid4())])

Is double_str string squared as unique as unique_str or just some amount more unique? Also, is there any negative implication in doing something like this (like some birthday problem situation, etc)? This may sound ignorant, but I simply would not know as my math spans algebra 2 at best.

Tarnetgaronne answered 29/11, 2010 at 17:39 Comment(2)
Is uniqueness a continuum? "More unique" is always puzzling, even when you can understand what is meant.Extravasate
@Fred - uuid4 is not guaranteed to not produce collisions. If it were, I wouldn't worry about doing this.Tarnetgaronne
T
17

The uuid4 function returns a UUID created from 16 random bytes and it is extremely unlikely to produce a collision, to the point at which you probably shouldn't even worry about it.

If for some reason uuid4 does produce a duplicate it is far more likely to be a programming error such as a failure to correctly initialize the random number generator than genuine bad luck. In which case the approach you are using it will not make it any better - an incorrectly initialized random number generator can still produce duplicates even with your approach.

If you use the default implementation random.seed(None) you can see in the source that only 16 bytes of randomness are used to initialize the random number generator, so this is an a issue you would have to solve first. Also, if the OS doesn't provide a source of randomness the system time will be used which is not very random at all.

But ignoring these practical issues, you are basically along the right lines. To use a mathematical approach we first have to define what you mean by "uniqueness". I think a reasonable definition is the number of ids you need to generate before the probability of generating a duplicate exceeds some probability p. An approcimate formula for this is:

alt text

where d is 2**(16*8) for a single randomly generated uuid and 2**(16*2*8) with your suggested approach. The square root in the formula is indeed due to the Birthday Paradox. But if you work it out you can see that if you square the range of values d while keeping p constant then you also square n.

Transcendent answered 29/11, 2010 at 17:45 Comment(8)
Right, improper initialization of the random number generator is the thing to worry about, not the vanishingly small chance of a collision.Bloomington
I don't understand. Are you guys referring to Python incorrectly initializing the random number generator? uuid4 doesn't require me to provide a random number generator. Am I missing something?Tarnetgaronne
@orokusaki: If you don't initialize the random number generator, the default behaviour will be used and this can mean that the system time could be used as the seed. The system time is not very random at all - two different computers will generate the same "random" numbers if the time happens to be the the same. See: docs.python.org/library/random.html#random.seedTranscendent
@Mark - Ok, so I just run random.seed('some long string') before using uuid4()? Do I have to use a random number in the seed too (ie, change the seed on every use), or should I just provide a really big hash, and use the same thing every time? The docs are not very clear.Tarnetgaronne
@orokusaki: No you shouldn't do that - that will definitely give duplicates. You need to use some true source of randomness.Transcendent
@orokusaki: If you don't know what to do then the default random seed should be good enough for you - most of the time you will get a decent seed, especially if the OS can provide a source of randomness that Python can use. The point of my answer is that the uuid4 is already good enough for most purposes, and if there is a problem it is more likely to come from the lack of randomness rather than accidental random collisions.Transcendent
@Mark - thanks. I don't really understand the control flow of using random.seed() in my code. Do I run random.seed(foo_random_number) right before running uuid.uuid4() each time?Tarnetgaronne
@orokusaki: No, you should initialize the random number generator once at the beginning of your program. Calling random.seed() more than once risks creating duplicates.Transcendent
B
1

Since uuid4 is based off a pseudo-random number generator, calling it twice is not going to square the amount of "uniqueness" (and may not even add any uniqueness at all).

See also When should I use uuid.uuid1() vs. uuid.uuid4() in python?

Bloomington answered 29/11, 2010 at 17:45 Comment(0)
S
-1

It depends on the random number generator, but it's almost squared uniqueness.

Spank answered 29/11, 2010 at 17:44 Comment(2)
Don't you reduce the period of the PRNG by using up another UUID when you didn't have to? Your uniqueness isn't squared tehn if your reducing the time it takes to get duplicate random nubmers.Scruff
@bot403, yes you reduce the period, but it depends on the amount of entropy in the PRNG as to whether that matters or not.Spank

© 2022 - 2024 — McMap. All rights reserved.