Kryo helps improve the performance of Spark applications by the efficient serialization approach.
I'm wondering, if Kryo will help in the case of SparkSQL, and how should I use it.
In SparkSQL applications, we'll do a lot of column based operations like df.select($"c1", $"c2")
, and the schema of DataFrame Row is not quite static.
Not sure how to register one or several serializer classes for the use case.
For example:
case class Info(name: String, address: String)
...
val df = spark.sparkContext.textFile(args(0))
.map(_.split(','))
.filter(_.length >= 2)
.map {e => Info(e(0), e(1))}
.toDF
df.select($"name") ... // followed by subsequent analysis
df.select($"address") ... // followed by subsequent analysis
I don't think it's a good idea to define case classes for each select
.
Or does it help if I register Info
like registerKryoClasses(Array(classOf[Info]))
ds
? If you have aDataset[Info]
, thene => Info()
is not necessary. – Chapen