Are random seeds compatible between systems?

Asked 12/9, 2018 at 11:17 Answered 13/9, 2018 at 10:11

Solved python random scikit-learn pyspark apache-spark-mllib

I made a random forest model using python's sklearn package where I set the seed to for example to 1234. To productionise models, we use pyspark. If I was to pass the same hyperparmeters and same seed value, i.e. 1234, will it get the same results?

Basically, do random seed numbers work between different systems?

Kuopio answered 12/9, 2018 at 11:17 Comment(1)

You can write a test case for that. Input the seed, generate 100 random number and check if they are the expected ones. – Hammerless 12/9, 2018 at 12:14

Well, this is exactly the kind of question that could really do with some experiments & code snippets provided...

Anyway, it seems that the general answer is a firm no: not only between Python and Spark MLlib, but even between Spark sub-modules, or between Python & Numpy...

Here is some reproducible code, run in the Databricks community cloud (where pyspark is already imported & the relevant contexts initialized):

import sys

import random
import pandas as pd
import numpy as np
from pyspark.sql.functions import rand, randn
from pyspark.mllib import random as r  # avoid conflict with native Python random module

print("Spark version " + spark.version)
print("Python version %s.%s.%s" % sys.version_info[:3])
print("Numpy version " + np.version.version)

# Spark version 2.3.1 
# Python version 3.5.2 
# Numpy version 1.11.1

s = 1234 # RNG seed


# Spark SQL random module:
spark_df = sqlContext.range(0, 10)
spark_df = spark_df.select("id", randn(seed=s).alias("normal"), rand(seed=s).alias("uniform"))


# Python 3 random module:
random.seed(s)
x = [random.uniform(0,1) for i in range(10)] # random.rand() gives exact same results

random.seed(s)
y = [random.normalvariate(0,1) for i in range(10)]

df = pd.DataFrame({'uniform':x, 'normal':y})


# numpy random module
np.random.seed(s)
xx = np.random.uniform(size=10)  # again, np.random.rand(10) gives exact same results

np.random.seed(s)
yy = np.random.randn(10)

numpy_df = pd.DataFrame({'uniform':xx, 'normal':yy})


# Spark MLlib random module
rdd_uniform = r.RandomRDDs.uniformRDD(sc, 10, seed=s).collect()
rdd_normal = r.RandomRDDs.normalRDD(sc, 10, seed=s).collect()

rdd_df = pd.DataFrame({'uniform':rdd_uniform, 'normal':rdd_normal})

And here are the results:

Native Python 3:

# df

     normal  uniform
0  1.430825 0.966454
1  1.803801 0.440733 
2  0.321290 0.007491 
3  0.599006 0.910976 
4 -0.700891 0.939269 
5  0.233350 0.582228
6 -0.613906 0.671563
7 -1.622382 0.083938
8  0.131975 0.766481
9  0.191054 0.236810

Numpy:

# numpy_df

     normal  uniform
0  0.471435 0.191519
1 -1.190976 0.622109 
2  1.432707 0.437728
3 -0.312652 0.785359
4 -0.720589 0.779976
5  0.887163 0.272593
6  0.859588 0.276464 
7 -0.636524 0.801872 
8  0.015696 0.958139
9 -2.242685 0.875933

Spark SQL:

# spark_df.show()

+---+--------------------+-------------------+ 
| id|              normal|            uniform|
+---+--------------------+-------------------+
|  0|  0.9707422835368164| 0.9499610869333489| 
|  1|  0.3641589200870126| 0.9682554532421536|
|  2|-0.22282955491417034|0.20293463923130883|
|  3|-0.00607734375219...|0.49540111648680385|
|  4|  -0.603246393509015|0.04350782074761239|
|  5|-0.12066287904491797|0.09390549680302918|
|  6|  0.2899567922101867| 0.6789838400775526|
|  7|  0.5827830892516723| 0.6560703836291193|
|  8|   1.351649207673346| 0.7750229279150739|
|  9|  0.5286035772104091| 0.6075560897646175|
+---+--------------------+-------------------+

Spark MLlib:

# rdd_df

     normal  uniform 
0 -0.957840 0.259282 
1  0.742598 0.674052 
2  0.225768 0.707127 
3  1.109644 0.850683 
4 -0.269745 0.414752 
5 -0.148916 0.494394 
6  0.172857 0.724337
7 -0.276485 0.252977
8 -0.963518 0.356758
9  1.366452 0.703145

Of course, even if the above results were identical, this would be no guarantee that results from, say, Random Forest in scikit-learn, would be exactly identical to the results of pyspark Random Forest...

Despite the negative answer, I really cannot see how that affects the deployment of any ML system, i.e. if the results depend crucially on the RNG, then something is definitely not right...

Progression answered 13/9, 2018 at 10:11 Comment(0)

In the old days portability of PRNGs was not a given. Differences in machine architecture, overflow handling, and implementation differences for both the algorithm being used and the language it was being implemented in meant that results could and did vary, even if they were nominally based on the same mathematical formulation. In 1979 Schrage (see page 1194 here) created a portable prime-modulus multiplicative linear congruential generator and showed that it could be implemented in a machine and language independent way "...as long as the machine can represent all integers in the interval -2³¹ to 2³ - 1." He gave a specific check that implementers could use to test their implementation, specifying what the 1000^th outcome should be given a particular seed value. Since Schrage's work, designing algorithms to be platform and language independent has become the norm.

Python's default generator is a Mersenne twister, and a variety of platform and language independent MT implementations are available on the Mersenne Twister home page. If Python switches its default generator in the future, then compatibility is not guaranteed unless you use one of the independent Python implementations available from the link above.

Broom answered 12/9, 2018 at 14:3 Comment(0)

Recommended topics

Hot tags