Spark GroupBy agg collect_list multiple columns
Asked Answered
C

1

9

I have a question similar to this but the number of columns to be operated by collect_list is given by a name list. For example:

scala> w.show
+---+-----+----+-----+
|iid|event|date|place|
+---+-----+----+-----+
|  A|   D1|  T0|   P1|
|  A|   D0|  T1|   P2|
|  B|   Y1|  T0|   P3|
|  B|   Y2|  T2|   P3|
|  C|   H1|  T0|   P5|
|  C|   H0|  T9|   P5|
|  B|   Y0|  T1|   P2|
|  B|   H1|  T3|   P6|
|  D|   H1|  T2|   P4|
+---+-----+----+-----+


scala> val combList = List("event", "date", "place")
combList: List[String] = List(event, date, place)

scala> val v = w.groupBy("iid").agg(collect_list(combList(0)), collect_list(combList(1)), collect_list(combList(2)))
v: org.apache.spark.sql.DataFrame = [iid: string, collect_list(event): array<string> ... 2 more fields]

scala> v.show
+---+-------------------+------------------+-------------------+
|iid|collect_list(event)|collect_list(date)|collect_list(place)|
+---+-------------------+------------------+-------------------+
|  B|   [Y1, Y2, Y0, H1]|  [T0, T2, T1, T3]|   [P3, P3, P2, P6]|
|  D|               [H1]|              [T2]|               [P4]|
|  C|           [H1, H0]|          [T0, T9]|           [P5, P5]|
|  A|           [D1, D0]|          [T0, T1]|           [P1, P2]|
+---+-------------------+------------------+-------------------+

Is there any way I can apply collect_list to multiple columns inside agg without knowing the number of elements in the combList prior?

Commute answered 13/2, 2018 at 4:42 Comment(7)
Hey Jonathan, did you figure this out? I'm facing a similar issueSodden
I'm afraid I did not find a solution but please note that this approach may not scale well for a large amount of data.Commute
Thanks! I found a work around by using dicts, for loop and joins. It does actually scale well for upto 2Billion rows and 30columns. I'll keep you posted if I get around to publishing a librarySodden
Maybe this could help - enter link description hereHoashis
Maybe this could help - aggregating-multiple-columns-with-custom-function-in-sparkHoashis
Possible duplicate of Aggregating multiple columns with custom function in sparkHoashis
Please refer this link for the solution. I've posted it there. <https://mcmap.net/q/1320072/-how-to-retrieve-all-columns-using-pyspark-collect_list-functions>Cattier
A
2

You can use collect_list(struct(col1, col2)) AS elements.

Example:

df.select("cd_issuer", "cd_doc", "cd_item", "nm_item").printSchema
val outputDf = spark.sql(s"SELECT cd_issuer, cd_doc, collect_list(struct(cd_item, nm_item)) AS item FROM teste GROUP BY cd_issuer, cd_doc")
outputDf.printSchema

df
 |-- cd_issuer: string (nullable = true)
 |-- cd_doc: string (nullable = true)
 |-- cd_item: string (nullable = true)
 |-- nm_item: string (nullable = true)

outputDf
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- item: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- cd_item: string (nullable = true)
|    |    |-- nm_item: string (nullable = true)
Agnostic answered 9/1, 2020 at 14:40 Comment(1)
I been looking for a group by example with collect_list only one I have found.Gerrilee

© 2022 - 2024 — McMap. All rights reserved.