I have developed a performance critical and CPU intensive application in Python. To reach acceptable performance I'm using PyPy and multiple processes generated with os.fork. Overall I'm now satisfied with performance which is near that of compiled languages. But I have to generate a lot of GUIDs initially I used:
import uuid
print(str(uuid.uuid4()))
To generate UUIDs. But profiling revealed that UUID generation accounts for ~20% of the total runtime. This seems to be due to the call to os.urandom()
in uuid.uuid4()
which seems to be slow on some systems (including OS X). I then replaced os.urandom()
with random.getrandbits()
:
import random
str(UUID(int=random.getrandbits(128), version=4))
Which is fast. But now it turns out my parallel running worker processes massively generate duplicated GUIDs.
I suspect the pseudo number generator of python is seeded with the current time and this is why different processes will generate the same random numbers. One possible fix I can think off, is to include the process id into the seed of the number generator. But since I'm not exactly sure how the uuid and the random packages are internally build, I'm not sure this will be enough to prevent collisions. I have found some alternative UUID implementations that were written as c extensions but due to PyPy I can't use them.
I'm using python just for this project due to certain libraries and I have very little experience coding in python.
Update:
Now I have added seed(SystemRandom().getrandbits(128))
after forking. Thus seeding the PRNG with /dev/urandom. This seems to work good so far.
I like the idea of seeding the child processes with the main process RNG as proposed by rossum. But now that I think of it, I guess using the RNG of the OS to seed the RNG should be even more secure. Especially, because my application also runs distributed on multiple nodes. IMHO seeding the initial RNG with a mac address and timestamp and then using rossums proposal should work too.