Say you have this (solution of encoding custom type is brought from this thread):
// assume we handle custom type
class MyObj(val i: Int, val j: String)
implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[MyObj]
val ds = spark.createDataset(Seq(new MyObj(1, "a"),new MyObj(2, "b"),new MyObj(3, "c")))
When do a ds.show
, I got:
+--------------------+
| value|
+--------------------+
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
+--------------------+
I understand that it's because the contents are encoded into internal Spark SQL binary representation. But how can I display the decoded content like this?
+---+---+
| _1| _2|
+---+---+
| 1| a|
| 2| b|
| 3| c|
+---+---+
UPDATE1
Displaying content is not the biggest issue, what's more important is that it could lead to problem when processing the dataset, consider this example:
// continue with the above code
val ds2 = spark.createDataset(Seq(new MyObj(2, "a"),new MyObj(6, "b"),new MyObj(5, "c")))
ds.joinWith(ds2, ds("i") === ds2("i"), "inner")
// this gives a Runtime error: org.apache.spark.sql.AnalysisException: Cannot resolve column name "i" among (value);
Does this mean, kryo
-encoded type is not able to do operation like joinWith
conveniently?
How do we process custom type on
Dataset
then?
If we are not able to process it after it's encoded, what's the point of thiskryo
encoding solution on custom type?!
(Solution provided by @jacek below is good to know for case class
type, but it still cannot decode custom type)
case class
), or writing to disk, we have to do theks.newInstance.deserialize
first? That quite confuses me about the usage of encoders inDataset
. – Sammiesammons