Fast GUIDs in Python

Asked 12/8, 2018 at 17:29 Answered 13/7, 2020 at 9:45

Solved python-3.x random multiprocessing guid pypy

I have developed a performance critical and CPU intensive application in Python. To reach acceptable performance I'm using PyPy and multiple processes generated with os.fork. Overall I'm now satisfied with performance which is near that of compiled languages. But I have to generate a lot of GUIDs initially I used:

import uuid
print(str(uuid.uuid4()))

To generate UUIDs. But profiling revealed that UUID generation accounts for ~20% of the total runtime. This seems to be due to the call to os.urandom() in uuid.uuid4() which seems to be slow on some systems (including OS X). I then replaced os.urandom() with random.getrandbits():

 import random
 str(UUID(int=random.getrandbits(128), version=4))

Which is fast. But now it turns out my parallel running worker processes massively generate duplicated GUIDs.

I suspect the pseudo number generator of python is seeded with the current time and this is why different processes will generate the same random numbers. One possible fix I can think off, is to include the process id into the seed of the number generator. But since I'm not exactly sure how the uuid and the random packages are internally build, I'm not sure this will be enough to prevent collisions. I have found some alternative UUID implementations that were written as c extensions but due to PyPy I can't use them.

I'm using python just for this project due to certain libraries and I have very little experience coding in python.

Update:

Now I have added seed(SystemRandom().getrandbits(128)) after forking. Thus seeding the PRNG with /dev/urandom. This seems to work good so far.

I like the idea of seeding the child processes with the main process RNG as proposed by rossum. But now that I think of it, I guess using the RNG of the OS to seed the RNG should be even more secure. Especially, because my application also runs distributed on multiple nodes. IMHO seeding the initial RNG with a mac address and timestamp and then using rossums proposal should work too.

Admirable answered 12/8, 2018 at 17:29 Comment(1)

Two ideas. First, use a single Master RNG and use that to generate a new seed for each process sub-RNG as required. Alternatively, generate short UUIDs and insert a guaranteed unique few bytes in specific allowed positions to bring them up to the correct size. – Spritsail 12/8, 2018 at 17:47

This is happening because process forking is causing each process to have a copy of the same memory, including a copy of the same PRNG state. In contrast, reseeding after a fork with SystemRandom solved this problem because the random numbers provided by SystemRandom are global to the operating system rather than to individual processes.

Extracanonical answered 7/5, 2020 at 13:20 Comment(0)

Note that using random.random this way is still potentially unsafe. It's using a fairly weak, predictable userspace PRNG. My first suggestion would be to simply replace

import random

with

from random import SystemRandom
random = SystemRandom()

But if that's still slow, I did make a deterministic-userspace-CSPRNG you can try, here: https://pypi.org/project/streamrandom/

This even includes UUID generation functionality, which you can use like so:

import os
from streamrandom import StreamRandom, stream_from_seed

random = StreamRandom(stream_from_seed(os.random(128).decode("charmap")))
uuid = random.uuid4()

Certainty answered 13/7, 2020 at 9:45 Comment(0)

Recommended topics

Hot tags