I am trying to understand the following code.
// File: LambdaTest.java
package test;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import java.io.Serializable;
import java.util.Arrays;
import java.util.List;
import java.util.function.Function;
public class LambdaTest implements Ops {
public static void main(String[] args) {
new LambdaTest().job();
}
public void job() {
SparkConf conf = new SparkConf()
.setAppName(LambdaTest.class.getName())
.setMaster("local[*]");
JavaSparkContext jsc = new JavaSparkContext(conf);
List<Integer> lst = Arrays.asList(1, 2, 3, 4, 5, 6);
JavaRDD<Integer> rdd = jsc.parallelize(lst);
Function<Integer, Integer> func1 = (Function<Integer, Integer> & Serializable) x -> x * x;
Function<Integer, Integer> func2 = x -> x * x;
System.out.println(func1.getClass()); //test.LambdaTest$$Lambda$8/390374517
System.out.println(func2.getClass()); //test.LambdaTest$$Lambda$9/208350681
this.doSomething(rdd, func1); // works
this.doSomething(rdd, func2); // org.apache.spark.SparkException: Task not serializable
}
}
// File: Ops.java
package test;
import org.apache.spark.api.java.JavaRDD;
import java.util.function.Function;
public interface Ops {
default void doSomething(JavaRDD<Integer> rdd, Function<Integer, Integer> func) {
rdd.map(x -> x + func.apply(x))
.collect()
.forEach(System.out::println);
}
}
The difference is func1
is casted with a Serializable
bound, while func2
is not.
When looking at the run time class of the two functions, they are both anonymous class under LambdaTest
class
They are both used in an RDD transformation in an interface, then the two functions and LambdaTest
should be serializable.
As you see, LambdaTest
does not implement Serializable
interface. So I think the two func should not work. But surprisingly, func1
works.
The stack trace for func2
is the following:
Serialization stack:
- object not serializable (class: test.LambdaTest$$Lambda$9/208350681, value: test.LambdaTest$$Lambda$9/208350681@61d84e08)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=interface fr.leboncoin.etl.jobs.test.Ops, functionalInterfaceMethod=org/apache/spark/api/java/function/Function.call:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic fr/leboncoin/etl/jobs/test/Ops.lambda$doSomething$1024e30a$1:(Ljava/util/function/Function;Ljava/lang/Integer;)Ljava/lang/Integer;, instantiatedMethodType=(Ljava/lang/Integer;)Ljava/lang/Integer;, numCaptured=1])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class fr.leboncoin.etl.jobs.test.Ops$$Lambda$10/1470295349, fr.leboncoin.etl.jobs.test.Ops$$Lambda$10/1470295349@4e1459ea)
- field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)
- object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
... 19 more
It seems that if a function bound with Serializable
, the object containing it need not to be serialized, which makes me confused.
Any explanation on this is highly appreciated.
------------------------------ Updates ------------------------------
I have tried to use abstract class instead of interface:
//File: AbstractTest.java
public class AbstractTest {
public static void main(String[] args) {
new AbstractTest().job();
}
public void job() {
SparkConf conf = new SparkConf()
.setAppName(AbstractTest.class.getName())
.setMaster("local[*]");
JavaSparkContext jsc = new JavaSparkContext(conf);
List<Integer> lst = Arrays.asList(1, 2, 3, 4, 5, 6);
JavaRDD<Integer> rdd = jsc.parallelize(lst);
Ops ops = new Ops() {
@Override
public Integer apply(Integer x) {
return x + 1;
}
};
System.out.println(ops.getClass()); // class fr.leboncoin.etl.jobs.test.AbstractTest$1
ops.doSomething(rdd);
}
}
// File: Ops.java
public abstract class Ops implements Serializable{
public abstract Integer apply(Integer x);
public void doSomething(JavaRDD<Integer> rdd) {
rdd.map(x -> x + apply(x))
.collect()
.forEach(System.out::println);
}
}
It does not work either, even if Ops
class is compiled in separate files with AbstractTest
class. The ops
object's class name is class fr.leboncoin.etl.jobs.test.AbstractTest$1
. According to the following stack track, it seem that it needs to serialize AbstractTest
in order to serialize AbstractTest$1
.
Serialization stack:
- object not serializable (class: test.AbstractTest, value: test.AbstractTest@21ac5eb4)
- field (class: test.AbstractTest$1, name: this$0, type: class test.AbstractTest)
- object (class test.AbstractTest$1, test.AbstractTest$1@36fc05ff)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class fr.leboncoin.etl.jobs.test.Ops, functionalInterfaceMethod=org/apache/spark/api/java/function/Function.call:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeSpecial fr/leboncoin/etl/jobs/test/Ops.lambda$doSomething$6d6228b6$1:(Ljava/lang/Integer;)Ljava/lang/Integer;, instantiatedMethodType=(Ljava/lang/Integer;)Ljava/lang/Integer;, numCaptured=1])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class fr.leboncoin.etl.jobs.test.Ops$$Lambda$8/208350681, fr.leboncoin.etl.jobs.test.Ops$$Lambda$8/208350681@4acb2510)
- field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)
- object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
... 19 more