Add UUID to spark dataset [duplicate]
Asked Answered
G

1

8

I am trying to add a UUID column to my dataset.

getDataset(Transaction.class)).withColumn("uniqueId", functions.lit(UUID.randomUUID().toString())).show(false);

But the result is all the rows have the same UUID. How can i make it unique?

+-----------------------------------+
uniqueId                            |
+----------------+-------+-----------
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
----------+----------------+--------+
Garry answered 9/4, 2018 at 14:57 Comment(2)
Check below link #37232116Suffice
No, i tried the solution in the link, it uses lit, which is not the right solution.Garry
B
13

Updated (Apr 2021):

Per @ferdyh, there's a better way using the uuid() function from Spark SQL. Something like expr("uuid()") will use Spark's native UUID generator, which should be much faster and cleaner to implement.

Originally (June 2018):

When you include UUID as a lit column, you're doing the same as including a string literal.

UUID needs to be generated for each row. You could do this with a UDF, however this can cause problems as UDFs are expected to be deterministic, and expecting randomness from them can cause issues when caching or regeneration happen.

Your best bet may be generating a column with the Spark function rand and using UUID.nameUUIDFromBytes to convert that to a UUID.

Originally, I had:

val uuid = udf(() => java.util.UUID.randomUUID().toString)
getDataset(Transaction.class).withColumn("uniqueId", uuid()).show(false);

which @irbull pointed out could be an issue.

Belie answered 9/4, 2018 at 15:3 Comment(9)
Thanks a lot Benjamin. This solution is working. In java creating a UDF is bit more tedious. UDF need to created ad registered like below: static UDF1 uniqueId= types -> UUID.randomUUID().toString(); sparkSession.udf().register("uId", mode, DataTypes.StringType);Garry
There are two problems with this solution. 1. UUID.randomUUID() is not guaranteed to be unique across nodes. It uses a pseudo-random number, which is fine on a single machine, but in a cluster environment, you could get collisions. 2. UDFs should be deterministic. That is, for the same input you get the same output (spark reserves the right to cache, reuse results, etc...), or call the same method multiple times if it chooses. #42961420Ariel
Great point @Ariel - I'll update to reflect.Belie
@Ariel then what would be a good way to generate new unique ids when appending rows to a dataframe? monotonically_increasing_id + last stored monotonically_increasing_id?Mumford
this generate unique id for each rowcsv.withColumn("uuid", monotonically_increasing_id()+monotonically_increasing_id)Nugatory
Just to be sure using monotonically_increasing_id with custom logic to add last max value will not work as the monotonically increasing id is calculated based on partition and row number, so the same dataframe can have values from 0 to 100 and 854654 to 854659Delastre
Interesting enough spark sql does have support for this issues.apache.org/jira/browse/SPARK-23599 In a way that is deterministic between retries.Saied
This is really slow solution and there is a much simpler solution: Spark has it's own uuid function. UDFs are bad for your spark code -- they are the golden hammer of spark coding.Pronghorn
uuid function isn't in the scala api yet, so you'd have to do something like expr("uuid()")Jelle

© 2022 - 2024 — McMap. All rights reserved.