Difference between a map and udf

About

Asked 19/8, 2016 at 12:28 Answered 19/8, 2016 at 13:45

When I work with DataFrames in Spark, I have to sometimes edit only the values of a particular column in that DataFrame. For eg. if I have a count field in my dataframe, and If I would like to add 1 to every value of count, then I could either write a custom udf to get the job done using the withColumn feature of DataFrames, or I could do a map on the DataFrame and then extract another DataFrame from the resultant RDD.

What I would like to know is how a udf actually works under the hood. Give me a comparison in using a map/udf in this case. What's the performance difference?

Thanks!

Hawkeyed answered 19/8, 2016 at 12:28 Comment(1)

https://mcmap.net/q/977559/-performance-impact-of-rdd-api-vs-udfs-mixed-with-dataframe-api/1560062 – Carn 19/8, 2016 at 13:44

Simply, map is more flexible than udf. With map, there is no restriction on the number of columns you can manipulate within a row. Say you want to derive the value for 5 columns of the data and delete 3 columns. You would need to do withColumn/udf 5 times, then a select. With 1 map function, you could do all of this.

Ardys answered 19/8, 2016 at 13:45 Comment(2)

If you're only processing one column, is it more efficient to use withColumn/udf than map? – Pastry 19/8, 2016 at 18:31

In general, creating a dataframe from a RDD is going to have some overhead so withColumn/udf should be more efficient. For more details, zero323's response here might be helpful https://mcmap.net/q/977559/-performance-impact-of-rdd-api-vs-udfs-mixed-with-dataframe-api/1560062 – Ardys 19/8, 2016 at 18:35

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags