Does Kryo help in SparkSQL?

Asked 14/3, 2018 at 6:15 Answered 14/3, 2018 at 9:3

Solved apache-spark apache-spark-sql kryo

Kryo helps improve the performance of Spark applications by the efficient serialization approach.
I'm wondering, if Kryo will help in the case of SparkSQL, and how should I use it.
In SparkSQL applications, we'll do a lot of column based operations like df.select($"c1", $"c2"), and the schema of DataFrame Row is not quite static.
Not sure how to register one or several serializer classes for the use case.

For example:

case class Info(name: String, address: String)
...
val df = spark.sparkContext.textFile(args(0))
         .map(_.split(','))
         .filter(_.length >= 2)
         .map {e => Info(e(0), e(1))}
         .toDF
df.select($"name") ... // followed by subsequent analysis
df.select($"address") ... // followed by subsequent analysis

I don't think it's a good idea to define case classes for each select.
Or does it help if I register Info like registerKryoClasses(Array(classOf[Info]))

Undine answered 14/3, 2018 at 6:15 Comment(3)

What is ds? If you have a Dataset[Info], then e => Info() is not necessary. – Chapen 14/3, 2018 at 7:20

I corrected the piece of code. – Undine 14/3, 2018 at 8:28

Possible duplicate of How to store custom objects in Dataset? – Chapen 14/3, 2018 at 16:9

According to Spark's documentation, SparkSQL does not uses Kryo or Java serializations.

Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network. While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.

They are much more lightweight than Java or Kryo, which is to be expected (it is a far more optimizable job to serialize, say a Row of 3 longs and two ints), than a class, its version description, its inner variables...) and having to instanciate it.

That being said, there is a way to use Kryo as an encoder implementation, see for example here : How to store custom objects in Dataset? . But this is meant as a solution to store custom objects (e.g. non product classes) in a Dataset, and not especially targeted at standard dataframes.

Without Kryo of Java serializers, creating encoders for custom, non product classes is somewhat limited (see the discussions on user defined types), for example, starting here : Does Apache spark 2.2 supports user-defined type (UDT)?

Raila answered 14/3, 2018 at 9:3 Comment(3)

Has this changed? I can create a DS or DF with implicit val FooEncoder: Encoder[Foo] = Encoders.kryo[Foo] and get a binary value. That would imply that things have changed? Well, yes it has as due to the link above. – Tabithatablature 23/10, 2020 at 21:52

I think the answer stands (for now). Dataset rely on encoders, the default ones (string, numbers, case and product classes...), are not "object serialization based", do not use Kryo, and are much faster for it. Kryo is available as an option if you need it - usually with a performance penalty (heavier on the CPU, on heap memory, and on dynamic execution optimization). People often refer to Kryo because back in the 1.x and early 2.x days, with the RDD API using java serialization, Kryo was a boost. Yet, the current Dataset(/frame) world is very different from that time. – Raila 24/10, 2020 at 20:9

That is my conclusion as well. how to read the value etc. tungsten. thx – Tabithatablature 24/10, 2020 at 20:12

You can set the serializer to kryo by setting the spark.serializer property to org.apache.spark.serializer.KryoSerializer either on your SparkConf or in a custom properties file you pass to the spark-submit command via the --properties-file flag.

When you configure the Kryo serializer Spark will transparently use Kryo when transmitting data between nodes. So your Spark SQL statements should automatically inherit the performance benefit.

Ancestor answered 14/3, 2018 at 6:29 Comment(3)

The problem is, when I register the class as conf.registerKryoClasses(Array(classOf[MyClass1])), how should I define the case class MyClass fields, given that the columns I'm interested in change from select to select? – Undine 14/3, 2018 at 6:35

Can you provide a bit more detail about what MyClass is doing? I thought you just wanted to execute a SparkSQL statement. – Ancestor 14/3, 2018 at 6:40

This comment is true when working with a RDD, but not when working with a Dataframe, where Kryo is not used (nor java serialization). – Raila 14/3, 2018 at 9:52

Recommended topics

Hot tags