Seeding numpy.random's default_rng and SeedSequence objects for concurrent.futures.ProcessPoolExecutor
Asked Answered
T

2

6

I am learning to set up the seed of NumPy ver 1.19 psuedo-random number generator for a Python 3.6 concurrent.futures.ProcessPoolExecutor analysis. After reading NumPy's documentation on Random sampling and Parallel Random Number Generation, I wrote the following script to evaluate my understanding.

My Objective: I want to ensure each concurrent process uses the same seed to start the random process.

What I leant from my Results?

  1. (a) Using a global seed, (b) predefining numpy.random.default_rng or numpy.random.SeedSequence with seed before passing it into a concurrent process and (c) passing a seed as an argument into the concurrent process give the same results and ensure that each concurrent process uses the same seed to start the random process. That is, there isn't a need to recreate a BitGenerator for each concurrent process.

  2. Using the spawned child seeds of a seeded numpy.random.SeedSequence() object cannot ensure each concurrent process uses the same seed to start the random process. The job of the spawn() method of the SeedSequence() object is to ensure different parts of the BitGenerator results are used so as to avoid repeats?

Question: Are my conclusions correct?

Test Script:

import numpy as np
from numpy.random import default_rng, SeedSequence
import concurrent.futures as cf

def random( loop ):
    rg = default_rng()
    return loop, [rg.random() for x in range(5)] 

def random_global( loop ):
    rg = default_rng(SEED)
    return loop, [rg.random() for x in range(5)] 
    
def random_rg( loop, rg ):
    return loop, [rg.random() for x in range(5)] 
    
def random_wseed( loop, seed ):
    rg = default_rng( seed )
    return loop, [rg.random() for x in range(5)]

def printresults( futures ):
    for future in cf.as_completed( futures ):    
        print( future.result() )
    

SEED = 1234
nworkers = 4
nloops = 4

rg = default_rng(SEED)

ss = SeedSequence(SEED)
child_seeds = ss.spawn(nloops) # Spawn off 4 child SeedSequences to pass to child processes.

futures_noseed = []
futures_global = []
futures_rg = []
futures_wseed = []
futures_seedseq = []
futures_seedseq_childseeds = []
with cf.ProcessPoolExecutor( max_workers=nworkers ) as executor:
    for nl in range(nloops):
        futures_noseed.append( executor.submit( random, nl ) )
        futures_global.append( executor.submit( random_global, nl ) )
        futures_rg.append( executor.submit( random_rg, nl, rg ) )
        futures_wseed.append( executor.submit( random_wseed, nl, SEED ) )
        futures_seedseq.append( executor.submit( random_wseed, nl, ss) )
        futures_seedseq_childseeds.append( executor.submit( random_wseed, nl, child_seeds[nl]) )

print( f'\nNO SEED')
printresults(futures_noseed)

print( f'\nGLOBAL SEED')
printresults(futures_global)

print( f'\nRG PREDEFINED WITH SEED PASS INTO FUNCTION')
printresults(futures_rg)

print(f'\nPASS SEED INTO FUNCTION')
printresults(futures_wseed)

print(f'\nWITH SEEDSEQUENCE')
printresults(futures_seedseq)

print(f'\nWITH SEEDSEQUENCE CHILD SEEDS')
printresults(futures_seedseq_childseeds)

Output:

NO SEED
(0, [0.739015261152181, 0.14451069021561325, 0.350594672768367, 0.20752211613920601, 0.795523682962996])
(2, [0.7984800506892198, 0.8583726299238038, 0.06791593362457293, 0.53430686768646, 0.0961085560717182])
(3, [0.5277372591285804, 0.33460069291263295, 0.8784128027557904, 0.9050110393243033, 0.6994660907632239])
(1, [0.5819290163279096, 0.9126020141058546, 0.17326463037949713, 0.8475223328152056, 0.23048284365911964])

GLOBAL SEED
(3, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])
(2, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])
(1, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])
(0, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])

RG PREDEFINED WITH SEED PASS INTO FUNCTION
(3, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])
(2, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])
(1, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])
(0, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])

PASS SEED INTO FUNCTION
(1, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])
(0, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])
(2, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])
(3, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])

WITH SEEDSEQUENCE
(2, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])
(3, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])
(1, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])
(0, [0.9766997666981422, 0.3801957350196178, 0.9232462337639554, 0.2616924238635442, 0.31909705841419755])

WITH SEEDSEQUENCE CHILD SEEDS
(2, [0.07734677155697511, 0.8570271790573564, 0.10048845220790636, 0.0478704579870608, 0.30020477671271684])
(3, [0.22148724095124595, 0.09787195733339815, 0.17127991416955768, 0.4819142922814075, 0.7368117871750866])
(1, [0.7137868247717851, 0.5945483974175882, 0.3889492785448826, 0.32053552182074196, 0.6488990935363684])
(0, [0.5293458940996787, 0.2331172694518996, 0.7607005642504421, 0.9940522082501517, 0.6181026121532509])
Towardly answered 17/7, 2020 at 19:11 Comment(7)
related:Same output in different workers in multiprocessing, github.com/numpy/numpy/issues/9650Packthread
"My Objective: I want to ensure each concurrent process uses the same seed to start the random process." - that's a little weird. Usually you'd want different seeds for each process.Thyroiditis
@user2357112supportsMonica In my real case, I have several scenarios and for each scenario, I have a for-loop for the random() function in which the 5 in range(5) varies. Hence, I thought it is appropriate that I should use different seeds (i.e. child seed) for each scenario while the seed for each loop should use the same seed. Is this approach reasonable or flawed?Towardly
@user2357112supportsMonica I don't think that's weird at all. using the same seed across processes for simulation-based optimization methods adds stability to the resultsLabors
@dieterw: How so? If the processes are running the same simulation independently, you just get the same results 4 times, which is useless, and if the processes are collaborating on one simulation, or running separate simulations with different initial conditions, you get spurious correlations between things handled by different processes. You'd usually want different seeds per process so you get statistically independent results.Thyroiditis
When I've seen seed handling discussed in multi-process optimization, the problem has always been making sure the processes have different seeds far away from each other in the RNG's sequence.Thyroiditis
@user2357112supportsMonica Absolutely valid, but not if your goal is to optimize with respect to a parameter with everything else equal -- including the randomness. This makes results comparable. E.g. is the objective higher because of the parameter change or the randomness? Same seed ensures its the parameter.Labors
T
0

Answer 2 of the question is not totally correct. Using passing the same spawned child seed of a seeded numpy.random.SeedSequence() object to different concurrent process, the same seed can be used to start the random process.

Below code shows an application where each concurrent process uses the same seed to start the random process.

import numpy as np
from numpy.random import default_rng, SeedSequence
import concurrent.futures as cf


def random_wseed( batch, loop, seed ):
    rg = default_rng( seed )
    return batch, loop, [rg.random() for x in range(5)]

SEED=1234
ss = SeedSequence(SEED)
nworkers = 4
nloops = 4
nbatch = 5

futures_seedseq_childseeds = []
with cf.ProcessPoolExecutor( max_workers=nworkers ) as executor:
    for batch in range(nbatch):
        child_seed = ss.spawn(1)
        for nl in range(nloops):
            futures_seedseq_childseeds.append( executor.submit( random_wseed, batch, nl, child_seed[0]) )

print(f'\nFor each batch, each concurrent process uses the same seed to start the random process.')
for future in cf.as_completed( futures_seedseq_childseeds ):    
        print( future.result() )

##Code should achieve this result:
##For each batch, each concurrent process uses the same seed to start the random process.
##(0, 0, [0.5293458940996787, 0.2331172694518996, 0.7607005642504421, 0.9940522082501517, 0.6181026121532509])
##(0, 1, [0.5293458940996787, 0.2331172694518996, 0.7607005642504421, 0.9940522082501517, 0.6181026121532509])
##(0, 2, [0.5293458940996787, 0.2331172694518996, 0.7607005642504421, 0.9940522082501517, 0.6181026121532509])
##(0, 3, [0.5293458940996787, 0.2331172694518996, 0.7607005642504421, 0.9940522082501517, 0.6181026121532509])
##(1, 0, [0.7137868247717851, 0.5945483974175882, 0.3889492785448826, 0.32053552182074196, 0.6488990935363684])
##(1, 1, [0.7137868247717851, 0.5945483974175882, 0.3889492785448826, 0.32053552182074196, 0.6488990935363684])
##(1, 2, [0.7137868247717851, 0.5945483974175882, 0.3889492785448826, 0.32053552182074196, 0.6488990935363684])
##(1, 3, [0.7137868247717851, 0.5945483974175882, 0.3889492785448826, 0.32053552182074196, 0.6488990935363684])
##(2, 0, [0.07734677155697511, 0.8570271790573564, 0.10048845220790636, 0.0478704579870608, 0.30020477671271684])
##(2, 1, [0.07734677155697511, 0.8570271790573564, 0.10048845220790636, 0.0478704579870608, 0.30020477671271684])
##(2, 2, [0.07734677155697511, 0.8570271790573564, 0.10048845220790636, 0.0478704579870608, 0.30020477671271684])
##(2, 3, [0.07734677155697511, 0.8570271790573564, 0.10048845220790636, 0.0478704579870608, 0.30020477671271684])
##(3, 0, [0.22148724095124595, 0.09787195733339815, 0.17127991416955768, 0.4819142922814075, 0.7368117871750866])
##(3, 1, [0.22148724095124595, 0.09787195733339815, 0.17127991416955768, 0.4819142922814075, 0.7368117871750866])
##(3, 2, [0.22148724095124595, 0.09787195733339815, 0.17127991416955768, 0.4819142922814075, 0.7368117871750866])
##(3, 3, [0.22148724095124595, 0.09787195733339815, 0.17127991416955768, 0.4819142922814075, 0.7368117871750866])
##(4, 0, [0.96083157225477, 0.5340463204254748, 0.028932799912096963, 0.4711509829841223, 0.20344219135413988])
##(4, 1, [0.96083157225477, 0.5340463204254748, 0.028932799912096963, 0.4711509829841223, 0.20344219135413988])
##(4, 2, [0.96083157225477, 0.5340463204254748, 0.028932799912096963, 0.4711509829841223, 0.20344219135413988])
##(4, 3, [0.96083157225477, 0.5340463204254748, 0.028932799912096963, 0.4711509829841223, 0.20344219135413988])

Result:

For each batch, each concurrent process uses the same seed to start the random process.
(4, 0, [0.96083157225477, 0.5340463204254748, 0.028932799912096963, 0.4711509829841223, 0.20344219135413988])
(2, 1, [0.07734677155697511, 0.8570271790573564, 0.10048845220790636, 0.0478704579870608, 0.30020477671271684])
(0, 0, [0.5293458940996787, 0.2331172694518996, 0.7607005642504421, 0.9940522082501517, 0.6181026121532509])
(0, 2, [0.5293458940996787, 0.2331172694518996, 0.7607005642504421, 0.9940522082501517, 0.6181026121532509])
(3, 3, [0.22148724095124595, 0.09787195733339815, 0.17127991416955768, 0.4819142922814075, 0.7368117871750866])
(2, 0, [0.07734677155697511, 0.8570271790573564, 0.10048845220790636, 0.0478704579870608, 0.30020477671271684])
(3, 2, [0.22148724095124595, 0.09787195733339815, 0.17127991416955768, 0.4819142922814075, 0.7368117871750866])
(1, 3, [0.7137868247717851, 0.5945483974175882, 0.3889492785448826, 0.32053552182074196, 0.6488990935363684])
(3, 1, [0.22148724095124595, 0.09787195733339815, 0.17127991416955768, 0.4819142922814075, 0.7368117871750866])
(1, 2, [0.7137868247717851, 0.5945483974175882, 0.3889492785448826, 0.32053552182074196, 0.6488990935363684])
(4, 3, [0.96083157225477, 0.5340463204254748, 0.028932799912096963, 0.4711509829841223, 0.20344219135413988])
(3, 0, [0.22148724095124595, 0.09787195733339815, 0.17127991416955768, 0.4819142922814075, 0.7368117871750866])
(1, 1, [0.7137868247717851, 0.5945483974175882, 0.3889492785448826, 0.32053552182074196, 0.6488990935363684])
(0, 1, [0.5293458940996787, 0.2331172694518996, 0.7607005642504421, 0.9940522082501517, 0.6181026121532509])
(4, 2, [0.96083157225477, 0.5340463204254748, 0.028932799912096963, 0.4711509829841223, 0.20344219135413988])
(2, 3, [0.07734677155697511, 0.8570271790573564, 0.10048845220790636, 0.0478704579870608, 0.30020477671271684])
(1, 0, [0.7137868247717851, 0.5945483974175882, 0.3889492785448826, 0.32053552182074196, 0.6488990935363684])
(4, 1, [0.96083157225477, 0.5340463204254748, 0.028932799912096963, 0.4711509829841223, 0.20344219135413988])
(2, 2, [0.07734677155697511, 0.8570271790573564, 0.10048845220790636, 0.0478704579870608, 0.30020477671271684])
(0, 3, [0.5293458940996787, 0.2331172694518996, 0.7607005642504421, 0.9940522082501517, 0.6181026121532509])
Towardly answered 29/9, 2021 at 20:56 Comment(0)
S
0

I think it is worth adding some notes on the SeedSequences and spawns themselves, without involving the topic of parallel execution libraries. I hope it helps to better understand them as it took some time for me too.

passing seed vs SeedSequence vs BitGenerator

You are right, it doesn't matter if you pass the final numpy.random.Generator object or just the same input, they all end up in an object that generates the same set of numbers.

prng_seed = numpy.random.Generator(numpy.random.MT19937(0))

ss = numpy.random.SeedSequence(0)
prng_ss = numpy.random.Generator(numpy.random.MT19937(ss))

Both prng_seed and prng_ss will have the same state. You can check the state of the BitGenerator by accessing prng_ss.bit_generator.state. The 624 pieces of integers that define the state of a MT19937 is in prng_ss.bit_generator.state["state"]["key"]. You can generate these integers from the SeedSequence object ss too: ss.generate_state(624).

The value 0 does not play the role of the seed number as it is detailed on Wikipedia, which is based on a scientific paper, but the SeedSequence implements a hash function that is responsible for generating the initial states. The value 0 can be still called as seed, but keep in mind that different seed values can lead to the same initial state as the pool size of SeedSequence's hash function if finite (by default, 2^128).

When you pass the BitGenerator to a thread, it will use its own copy, therefore generating random numbers in one thread does not affect other threads. If you pass the seed integers or the SeedSequence objects, the threads have to construct the BitGenerator objects. Many users use this technique, because they want different initial states, and it is cheaper to pass a single integer than passing the whole BitGenerator. Moreover, generating N pieces of BitGenerators with different initial state on the main thread can take more time than creating the BitGenerators on N threads, where each thread creates only 1 BitGenerator. But ofc, in your case, it is enough to construct only 1 BitGenerator, and copy it, therefore there is no remarkable speed gain.

spawn

Using the spawn method and spawn key is just a bit more than providing another seed number for the BitGenerator. The advantage is that whenever you call spawn(n), it automatically increases the extra seed number provided for the SeedSequence.

ss_spawn = numpy.random.SeedSequence(entropy=7, spawn_key=())
print(ss_spawn.generate_state(1)) # 2083679832
print(ss_spawn) # entropy=7

child_1 = ss_spawn.spawn(1)[0]
print(ss_spawn.generate_state(1)) # 2083679832
print(ss_spawn) # entropy=7, n_children_spawned=1

child_2 = ss_spawn.spawn(1)[0]
print(ss_spawn.generate_state(1)) # 2083679832
print(ss_spawn) # entropy=0, n_children_spawned=2

But the generated children will be different:

print(child_1) # entropy=0, spawn_key=(0,)
print(child_1.generate_state(1)) # 1201125462
print(child_2) # entropy=0, spawn_key=(1,)
print(child_2.generate_state(1)) # 3618983171

These children are not special snowflakes, they can be generated directly:

ss_direct_child_1 = numpy.random.SeedSequence(entropy=7, spawn_key=(0,))
ss_direct_child_2 = numpy.random.SeedSequence(entropy=7, spawn_key=(1,))
print(ss_direct_child_1.generate_state(1)) # 1201125462
print(ss_direct_child_2.generate_state(1)) # 3618983171

Spawn key is just another source of entropy, as the documentation says. I.e. the effect is similar to adding the spawn key to the entropy directly, it will also generate a different state when executed. However, the implementation is apparently different, and I don't how to arrange the spawn key in the entropy to get the same state. Trailing 0s in the entropy are also discarded:

ss_single = numpy.random.SeedSequence(entropy=7)
print(ss_single.generate_state(1)) # 2083679832
print(ss_single) # entropy=7
ss_single_w0 = numpy.random.SeedSequence(entropy=[7,0])
print(ss_single_w0.generate_state(1)) # 2083679832
print(ss_single_w0) # entropy=[7, 0]

Clarification why the same set of random numbers are used

If I am not mistaken, you'd like to run a type of a simulation with a system of different initial states, where your system undergoes a stochastic process, and you'd like to do a statistical analysis on the results. Although you could use different random numbers to mimic the stochastic process, you believe that using the same set of random numbers will help you in analyzing the outcome. The differences between each realization of the simulation is provided by the different initial states.

Siegler answered 20/10, 2021 at 21:2 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.