I am going to ask this question in the context of Spark, because that's what I'm facing, but this might be a plain Java issue.
In our spark job, we have a Resolver
which needs to be used in all of our workers (it's used in a udf). The problem is that it's not serializable and we cannot change it to be so. The solution was to put it as a member of another class which is serializable.
So we ended up with:
public class Analyzer implements Serializable {
transient Resolver resolver;
public Analyzer() {
System.out.println("Initializing a Resolver...");
resolver = new Resolver();
}
public int resolve(String key) {
return resolver.find(key);
}
}
We then broadcast
this class using the Spark API:
val analyzer = sparkContext.broadcast(new Analyzer())
(more information about Spark broadcast can be found here)
We then proceed to use analyzer
in a UDF, as part of our spark code, with something like:
val resolve = udf((key: String) => analyzer.value.resolve(key))
val result = myDataFrame.select("key", resolve("key")).count()
This all works as expected, but leaves we wondering.
Resolver
does not implement Serializable
and is, therefore, marked as transient
- meaning it does not get serialized along with it's owner object Analyzer
.
But as you can see clearly from the code above, the resolve()
method uses resolver
, so it must not be null. And indeed it isn't. The code works.
So if the field is not passed through serialization, how is the resolver
member instantiated?
My initial thought was that maybe the Analyzer
constructor is called on the receiving side (i.e. the spark worker), but then I would expect to see the line "Initializing a Resolver..."
printed several times. But it's only printed once, which is probably an indication to the fact that it's only called once, right before it's passed to the broadcast API. So why isn't resolver
null?
Am I missing something about JVM serialization or Spark serialization?
How does this code even work?
Spark runs on YARN, in cluster
mode.
spark.serializer
is set to org.apache.spark.serializer.KryoSerializer
.
org.apache.spark.serializer.JavaSerializer
it would be null as expected). – Upbowtransient
it fails, and the stack-trace leads to kryo – Arterio