Kryo Serialization for Spark 2.x Dataset

You don't need to use Kryo for a dataset if you have an Encoder in scope that can serialize the dataset's type (like an ExpressionEncoder or RowEncoder). Those can do field-level serialization so you can do things like filter on a column within the dataset without unpacking the whole object. Encoders have other optimizations like "runtime code generation to build custom bytecode for serialization and deserialization," and can be many times faster than Kryo.

However if you try to put a type in a Dataset and Spark can't find an Encoder for it, you'll get an error either at compile time or at runtime (if an unserializable type is nested inside a case class or something). For example, let's say that you wanted to use the DoubleRBTreeSet from the fastutil library. In that situation you'd need to define an Encoder for it, and a quick fix is often to use Kryo:

implicit val rbTreeEncoder = Encoders.kryo[DoubleRBTreeSet]

Recommended topics

Hot tags