Aggregating multiple columns with custom function in Spark

H

5

45

I was wondering if there is some way to specify a custom aggregation function for spark dataframes over multiple columns.

I have a table like this of the type (name, item, price):

john | tomato | 1.99
john | carrot | 0.45
bill | apple  | 0.99
john | banana | 1.29
bill | taco   | 2.59

to:

I would like to aggregate the item and it's cost for each person into a list like this:

john | (tomato, 1.99), (carrot, 0.45), (banana, 1.29)
bill | (apple, 0.99), (taco, 2.59)

Is this possible in dataframes? I recently learned about collect_list but it appears to only work for one column.

Horlacher answered 9/6, 2016 at 23:38 Comment(0)

C

38

The easiest way to do this as a DataFrame is to first collect two lists, and then use a UDF to zip the two lists together. Something like:

import org.apache.spark.sql.functions.{collect_list, udf}
import sqlContext.implicits._

val zipper = udf[Seq[(String, Double)], Seq[String], Seq[Double]](_.zip(_))

val df = Seq(
  ("john", "tomato", 1.99),
  ("john", "carrot", 0.45),
  ("bill", "apple", 0.99),
  ("john", "banana", 1.29),
  ("bill", "taco", 2.59)
).toDF("name", "food", "price")

val df2 = df.groupBy("name").agg(
  collect_list(col("food")) as "food",
  collect_list(col("price")) as "price" 
).withColumn("food", zipper(col("food"), col("price"))).drop("price")

df2.show(false)
# +----+---------------------------------------------+
# |name|food                                         |
# +----+---------------------------------------------+
# |john|[[tomato,1.99], [carrot,0.45], [banana,1.29]]|
# |bill|[[apple,0.99], [taco,2.59]]                  |
# +----+---------------------------------------------+

Changeover answered 10/6, 2016 at 11:37 Comment(8)

I used col(...) instead of $"..." for a reason -- I find col(...) works with less work inside of things like class definitions. – Changeover 10/6, 2016 at 11:58

Is there any function to realign columns like for example in the zip function tell it to first add an element from the tail of the column and remove one from the head and then zip them? In this case you can have for example next price for the items if you read prices daily and there is a time column. – Ane 1/9, 2016 at 14:24

Not entirely sure what you are asking. But you can use DataFrame.select (...) to change the order of columns. – Changeover 1/9, 2016 at 15:22

I meant like this question: stackoverflow.com/q/39274585/2525128. I used this answer a lot on my code but I am trying to use the same method for time series data and add the next occurrence of an event for an specific observation as a field but I since I do not know exactly when that happens it is a little hard to make happen. – Ane 1/9, 2016 at 16:24

The answer assumes (maybe correctly) that collect_list() will preserve the order of elements on the two columns food & price. Meaning that food and price from the same row will end up at the same index in the two collected lists. Is this order preserving behavior guaranteed? (it would make sense, but I'm not sure by looking at the scala code for collect_list, not a scala programmer). – Andrel 11/1, 2017 at 14:21

Afaik, there is no guarantee that the order of elements will be the same. cf : #40408014 – Autoxidation 12/10, 2017 at 11:44

I used a variation of this solution to zip five lists together. This gave me the opportunity to write the best line of code of my career so far: _ zip _ zip _ zip _ zip _ – Omidyar 20/4, 2018 at 20:39

Note: The function is non-deterministic because the order of collected results depends on order of rows which may be non-deterministic after a shuffle. spark.apache.org/docs/latest/api/python/… – Proliferation 27/11, 2018 at 1:14

C

112

Consider using the struct function to group the columns together before collecting as a list:

import org.apache.spark.sql.functions.{collect_list, struct}
import sqlContext.implicits._

val df = Seq(
  ("john", "tomato", 1.99),
  ("john", "carrot", 0.45),
  ("bill", "apple", 0.99),
  ("john", "banana", 1.29),
  ("bill", "taco", 2.59)
).toDF("name", "food", "price")

df.groupBy($"name")
  .agg(collect_list(struct($"food", $"price")).as("foods"))
  .show(false)

Outputs:

+----+---------------------------------------------+
|name|foods                                        |
+----+---------------------------------------------+
|john|[[tomato,1.99], [carrot,0.45], [banana,1.29]]|
|bill|[[apple,0.99], [taco,2.59]]                  |
+----+---------------------------------------------+

Corporeity answered 10/3, 2017 at 19:50 Comment(2)

I want to mention that this approach looks cleaner than the accepted answer, but unfortunately doesn't work with spark 1.6, because collect_list() doesn't accept a struct. – Geodesic 8/1, 2018 at 4:39

Works in Spark 2.1 – Mendenhall 29/3, 2018 at 16:44

C

38

The easiest way to do this as a DataFrame is to first collect two lists, and then use a UDF to zip the two lists together. Something like:

import org.apache.spark.sql.functions.{collect_list, udf}
import sqlContext.implicits._

val zipper = udf[Seq[(String, Double)], Seq[String], Seq[Double]](_.zip(_))

val df = Seq(
  ("john", "tomato", 1.99),
  ("john", "carrot", 0.45),
  ("bill", "apple", 0.99),
  ("john", "banana", 1.29),
  ("bill", "taco", 2.59)
).toDF("name", "food", "price")

val df2 = df.groupBy("name").agg(
  collect_list(col("food")) as "food",
  collect_list(col("price")) as "price" 
).withColumn("food", zipper(col("food"), col("price"))).drop("price")

df2.show(false)
# +----+---------------------------------------------+
# |name|food                                         |
# +----+---------------------------------------------+
# |john|[[tomato,1.99], [carrot,0.45], [banana,1.29]]|
# |bill|[[apple,0.99], [taco,2.59]]                  |
# +----+---------------------------------------------+