Given following code:
import java.sql.Date
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SortQuestion extends App{
val spark = SparkSession.builder().appName("local").master("local[*]").getOrCreate()
import spark.implicits._
case class ABC(a: Int, b: Int, c: Int)
val first = Seq(
ABC(1, 2, 3),
ABC(1, 3, 4),
ABC(2, 4, 5),
ABC(2, 5, 6)
).toDF("a", "b", "c")
val second = Seq(
(1, 2, (Date.valueOf("2018-01-02"), 30)),
(1, 3, (Date.valueOf("2018-01-01"), 20)),
(2, 4, (Date.valueOf("2018-01-02"), 50)),
(2, 5, (Date.valueOf("2018-01-01"), 60))
).toDF("a", "b", "c")
first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b")).groupBy("a").agg(sort_array(collect_list("c2")))
.show(false)
}
Spark produces following result:
+---+----------------------------------+
|a |sort_array(collect_list(c2), true)|
+---+----------------------------------+
|1 |[[2018-01-01,20], [2018-01-02,30]]|
|2 |[[2018-01-01,60], [2018-01-02,50]]|
+---+----------------------------------+
This implies that Spark is sorting an array by date (since it is the first field), but I want to instruct Spark to sort by specific field from that nested struct.
I know I can reshape array to (value, date)
but it seems inconvenient, I want a general solution (imagine I have a big nested struct, 5 layers deep, and I want to sort that structure by particular column).
Is there a way to do that? Am I missing something?
Array[Seq[(Date, Int)]]
but it may not fit on one machine (due to large DataFrame) – Dace