Add UUID to spark dataset [duplicate]

About

Asked 9/4, 2018 at 14:57 Answered 9/4, 2018 at 15:3

Solved apache-spark apache-spark-dataset spark-csv

I am trying to add a UUID column to my dataset.

getDataset(Transaction.class)).withColumn("uniqueId", functions.lit(UUID.randomUUID().toString())).show(false);

But the result is all the rows have the same UUID. How can i make it unique?

+-----------------------------------+
uniqueId                            |
+----------------+-------+-----------
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
----------+----------------+--------+

Garry answered 9/4, 2018 at 14:57 Comment(2)

Check below link #37232116 – Suffice 9/4, 2018 at 15:14

No, i tried the solution in the link, it uses lit, which is not the right solution. – Garry 9/4, 2018 at 15:39

Updated (Apr 2021):

Per @ferdyh, there's a better way using the uuid() function from Spark SQL. Something like expr("uuid()") will use Spark's native UUID generator, which should be much faster and cleaner to implement.

Originally (June 2018):

When you include UUID as a lit column, you're doing the same as including a string literal.

UUID needs to be generated for each row. You could do this with a UDF, however this can cause problems as UDFs are expected to be deterministic, and expecting randomness from them can cause issues when caching or regeneration happen.

Your best bet may be generating a column with the Spark function rand and using UUID.nameUUIDFromBytes to convert that to a UUID.

Originally, I had:

val uuid = udf(() => java.util.UUID.randomUUID().toString)
getDataset(Transaction.class).withColumn("uniqueId", uuid()).show(false);

which @irbull pointed out could be an issue.

Belie answered 9/4, 2018 at 15:3 Comment(9)

Thanks a lot Benjamin. This solution is working. In java creating a UDF is bit more tedious. UDF need to created ad registered like below: static UDF1 uniqueId= types -> UUID.randomUUID().toString(); sparkSession.udf().register("uId", mode, DataTypes.StringType); – Garry 10/4, 2018 at 8:55

There are two problems with this solution. 1. UUID.randomUUID() is not guaranteed to be unique across nodes. It uses a pseudo-random number, which is fine on a single machine, but in a cluster environment, you could get collisions. 2. UDFs should be deterministic. That is, for the same input you get the same output (spark reserves the right to cache, reuse results, etc...), or call the same method multiple times if it chooses. #42961420 – Ariel 13/6, 2018 at 23:2

Great point @Ariel - I'll update to reflect. – Belie 14/6, 2018 at 15:33

@Ariel then what would be a good way to generate new unique ids when appending rows to a dataframe? monotonically_increasing_id + last stored monotonically_increasing_id? – Mumford 27/1, 2019 at 17:34

this generate unique id for each rowcsv.withColumn("uuid", monotonically_increasing_id()+monotonically_increasing_id) – Nugatory 23/1, 2020 at 5:27

Just to be sure using monotonically_increasing_id with custom logic to add last max value will not work as the monotonically increasing id is calculated based on partition and row number, so the same dataframe can have values from 0 to 100 and 854654 to 854659 – Delastre 20/2, 2020 at 15:43

Interesting enough spark sql does have support for this issues.apache.org/jira/browse/SPARK-23599 In a way that is deterministic between retries. – Saied 30/3, 2020 at 7:28

This is really slow solution and there is a much simpler solution: Spark has it's own uuid function. UDFs are bad for your spark code -- they are the golden hammer of spark coding. – Pronghorn 7/7, 2021 at 9:16

uuid function isn't in the scala api yet, so you'd have to do something like expr("uuid()") – Jelle 24/8, 2021 at 13:2

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags