How to store nested custom objects in Spark Dataset?

Asked 3/10, 2020 at 23:59 Answered 7/10, 2020 at 16:18

Solved apache-spark apache-spark-sql apache-spark-dataset kryo

The question is a follow-up of How to store custom objects in Dataset?

Spark version: 3.0.1

Non-nested custom type is achievable:

import spark.implicits._
import org.apache.spark.sql.{Encoder, Encoders}

class AnObj(val a: Int, val b: String)

implicit val myEncoder: Encoder[AnObj] = Encoders.kryo[AnObj] 

val d = spark.createDataset(Seq(new AnObj(1, "a")))

d.printSchema
root
 |-- value: binary (nullable = true)

However, if the custom type is nested inside a product type (i.e. case class), it gives an error:

java.lang.UnsupportedOperationException: No Encoder found for InnerObj

import spark.implicits._
import org.apache.spark.sql.{Encoder, Encoders}

class InnerObj(val a: Int, val b: String)
case class MyObj(val i: Int, val j: InnerObj)

implicit val myEncoder: Encoder[InnerObj] = Encoders.kryo[InnerObj] 

// error
val d = spark.createDataset(Seq(new MyObj(1, new InnerObj(0, "a"))))
// it gives Runtime error: java.lang.UnsupportedOperationException: No Encoder found for InnerObj

How can we create Dataset with nested custom type?

Ingroup answered 3/10, 2020 at 23:59 Comment(0)

Adding the encoders for both MyObj and InnerObj should make it work.

  class InnerObj(val a:Int, val b: String)
  case class MyObj(val i: Int, j: InnerObj)

  implicit val myEncoder: Encoder[InnerObj] = Encoders.kryo[InnerObj]
  implicit val objEncoder: Encoder[MyObj] = Encoders.kryo[MyObj]

The above snippet compile and run fine

Otorhinolaryngology answered 4/10, 2020 at 6:21 Comment(1)

That's embarrassing! I thought with import spark.implicits._ and case class, I don't need to create the encoder for it. But it seems as long as there's nested custom type, I still need to create one for case class. Thanks for the answer – Ingroup 4/10, 2020 at 9:8

Another solution apart from sujesh's:

import spark.implicits._
import org.apache.spark.sql.{Encoder, Encoders}

class InnerObj(val a: Int, val b: String)
case class MyObj[T](val i: Int, val j: T)

implicit val myEncoder: Encoder[MyObj[InnerObj]] = Encoders.kryo[MyObj[InnerObj]] 

// works
val d = spark.createDataset(Seq(new MyObj(1, new InnerObj(0, "a"))))

This also shows a difference between the case where the inner type can be deduced from the type parameter, and the case where it cannot be deduced.

The former case should be done:

implicit val myEncoder: Encoder[MyObj[InnerObj]] = Encoders.kryo[MyObj[InnerObj]]

The later case should be done:

implicit val myEncoder1: Encoder[InnerObj] = Encoders.kryo[InnerObj]
implicit val myEncoder2: Encoder[MyObj] = Encoders.kryo[MyObj]

Ingroup answered 7/10, 2020 at 16:18 Comment(1)

more precise. would be good pattern when you have type classes and more abstractions. – Otorhinolaryngology 8/10, 2020 at 3:58

Recommended topics

Hot tags