Does Kryo help in SparkSQL?
Asked Answered
U

2

6

Kryo helps improve the performance of Spark applications by the efficient serialization approach.
I'm wondering, if Kryo will help in the case of SparkSQL, and how should I use it.
In SparkSQL applications, we'll do a lot of column based operations like df.select($"c1", $"c2"), and the schema of DataFrame Row is not quite static.
Not sure how to register one or several serializer classes for the use case.

For example:

case class Info(name: String, address: String)
...
val df = spark.sparkContext.textFile(args(0))
         .map(_.split(','))
         .filter(_.length >= 2)
         .map {e => Info(e(0), e(1))}
         .toDF
df.select($"name") ... // followed by subsequent analysis
df.select($"address") ... // followed by subsequent analysis

I don't think it's a good idea to define case classes for each select.
Or does it help if I register Info like registerKryoClasses(Array(classOf[Info]))

Undine answered 14/3, 2018 at 6:15 Comment(3)
What is ds? If you have a Dataset[Info], then e => Info() is not necessary.Chapen
I corrected the piece of code.Undine
Possible duplicate of How to store custom objects in Dataset?Chapen
R
13

According to Spark's documentation, SparkSQL does not uses Kryo or Java serializations.

Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network. While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.

They are much more lightweight than Java or Kryo, which is to be expected (it is a far more optimizable job to serialize, say a Row of 3 longs and two ints), than a class, its version description, its inner variables...) and having to instanciate it.

That being said, there is a way to use Kryo as an encoder implementation, see for example here : How to store custom objects in Dataset? . But this is meant as a solution to store custom objects (e.g. non product classes) in a Dataset, and not especially targeted at standard dataframes.

Without Kryo of Java serializers, creating encoders for custom, non product classes is somewhat limited (see the discussions on user defined types), for example, starting here : Does Apache spark 2.2 supports user-defined type (UDT)?

Raila answered 14/3, 2018 at 9:3 Comment(3)
Has this changed? I can create a DS or DF with implicit val FooEncoder: Encoder[Foo] = Encoders.kryo[Foo] and get a binary value. That would imply that things have changed? Well, yes it has as due to the link above.Tabithatablature
I think the answer stands (for now). Dataset rely on encoders, the default ones (string, numbers, case and product classes...), are not "object serialization based", do not use Kryo, and are much faster for it. Kryo is available as an option if you need it - usually with a performance penalty (heavier on the CPU, on heap memory, and on dynamic execution optimization). People often refer to Kryo because back in the 1.x and early 2.x days, with the RDD API using java serialization, Kryo was a boost. Yet, the current Dataset(/frame) world is very different from that time.Raila
That is my conclusion as well. how to read the value etc. tungsten. thxTabithatablature
A
0

You can set the serializer to kryo by setting the spark.serializer property to org.apache.spark.serializer.KryoSerializer either on your SparkConf or in a custom properties file you pass to the spark-submit command via the --properties-file flag.

When you configure the Kryo serializer Spark will transparently use Kryo when transmitting data between nodes. So your Spark SQL statements should automatically inherit the performance benefit.

Ancestor answered 14/3, 2018 at 6:29 Comment(3)
The problem is, when I register the class as conf.registerKryoClasses(Array(classOf[MyClass1])), how should I define the case class MyClass fields, given that the columns I'm interested in change from select to select?Undine
Can you provide a bit more detail about what MyClass is doing? I thought you just wanted to execute a SparkSQL statement.Ancestor
This comment is true when working with a RDD, but not when working with a Dataframe, where Kryo is not used (nor java serialization).Raila

© 2022 - 2024 — McMap. All rights reserved.