Kryo Serialization for Spark 2.x Dataset
Asked Answered
B

1

7

Is Kryo serialization still required when working with the Dataset API?

Because Datasets use Encoders for or serialization and deserialization:

  1. Does Kyro serialization even work for Datasets? (Provided the right config is passed to Spark, and classes are properly registered)
  2. If it works, how much performance improvement would it provide? Thanks.
Braeunig answered 24/6, 2017 at 8:36 Comment(0)
H
1

You don't need to use Kryo for a dataset if you have an Encoder in scope that can serialize the dataset's type (like an ExpressionEncoder or RowEncoder). Those can do field-level serialization so you can do things like filter on a column within the dataset without unpacking the whole object. Encoders have other optimizations like "runtime code generation to build custom bytecode for serialization and deserialization," and can be many times faster than Kryo.

However if you try to put a type in a Dataset and Spark can't find an Encoder for it, you'll get an error either at compile time or at runtime (if an unserializable type is nested inside a case class or something). For example, let's say that you wanted to use the DoubleRBTreeSet from the fastutil library. In that situation you'd need to define an Encoder for it, and a quick fix is often to use Kryo:

implicit val rbTreeEncoder = Encoders.kryo[DoubleRBTreeSet]
Heave answered 7/9, 2018 at 21:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.