When to use Kryo serialization in Spark?

Asked 26/10, 2016 at 12:13 Answered 28/10, 2016 at 12:55

I am already compressing RDDs using conf.set("spark.rdd.compress","true") and persist(MEMORY_AND_DISK_SER). Will using Kryo serialization make the program even more efficient, or is it not useful in this case? I know that Kryo is for sending the data between the nodes in a more efficient way. But if the communicated data is already compressed, is it even needed?

Assyria answered 26/10, 2016 at 12:13 Comment(1)

as I understand, Spark compresses byte array provided by serialization mechanisms (after serialization occurs), which makes communication faster. But it doesn't improve the speed of serialization itself as it uses standard Java serializers. – Embowed 26/10, 2016 at 14:14

Both of the RDD states you described (compressed and persisted) use serialization. When you persist an RDD, you are serializing it and saving it to disk (in your case, compressing the serialized output as well). You are right that serialization is also used for shuffles (sending data between nodes): any time data needs to leave a JVM, whether it's going to local disk or through the network, it needs to be serialized.

Kryo is a significantly optimized serializer, and performs better than the standard java serializer for just about everything. In your case, you may actually be using Kryo already. You can check your spark configuration parameter:

"spark.serializer" should be "org.apache.spark.serializer.KryoSerializer".

If it's not, then you can set this internally with:

conf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" )

Regarding your last question ("is it even needed?"), it's hard to make a general claim about that. Kryo optimizes one of the slow steps in communicating data, but it's entirely possible that in your use case, others are holding you back. But there's no downside to trying Kryo and benchmarking the difference!

Plummet answered 28/10, 2016 at 12:55 Comment(2)

I recieve an error: not found: value conf. how does one solve it? – Intransitive 19/10, 2018 at 10:14

value conf is instance of SparkConf() class – Alfano 13/2, 2020 at 13:46

Kryo serialization is a more optimized serialization technique so you can use it to serialize any class which is used in an RDD or Dataframe closure. For some specific information use of Kryo serialization, see below:

Use when serializing third party non-serialize classes inside an RDD or dataframe closure
You want to use efficient serialization technique
If you ever got a serialization error because of some class, you can register that class with the Kryo serializer

Uruguay answered 26/10, 2016 at 12:21 Comment(0)

Considering another point: kyro is faster than the default in serialization and deserialization, so it's better to use kyro. But the performance increase may be not as good as said, there are other points which will influence the program speed, like how you write your spark code, which lib you choose.

Dermatologist answered 26/10, 2016 at 13:15 Comment(0)

Recommended topics

Hot tags