I want to add a column of random values to a dataframe (has an id for each row) for something I am testing. I am struggling to get reproducible results across Spark sessions - same random value against each row id. I am able to reproduce the results by using
from pyspark.sql.functions import rand
new_df = my_df.withColumn("rand_index", rand(seed = 7))
but it only works when I am running it in same Spark session. I am not getting same results once I relaunch Spark and run my script.
I also tried defining a udf, testing to see if i can generate random values (integers) within an interval and using random from Python with random.seed set
import random
random.seed(7)
spark.udf.register("getRandVals", lambda x, y: random.randint(x, y), LongType())
but to no avail.
Is there a way to ensure reproducible random number generation across Spark sessions such that a row id gets same random value? I would really appreciate some guidance :) Thanks for the help!
id
column (assumption - I hopeid
column is unique in nature) and then insert random value column. Something like -df.orderBy(df.id.desc()).withColumn("rand_index", rand(seed=7))
? – Anomalousid
to control data distribution and thereby ensure a row/id gets same random value assigned each time - do i get it right? – Andryc