NotSerializableException when sorting in Spark
Asked Answered
R

1

7

I'm trying to write a simple stream processing Spark job which will take a list of messages (JSON-formatted), each belonging to a user, count the messages of each user and print the top ten users.

However, when I define the Comparator> to sort the reduced counts the whole thing fails with a java.io.NotSerializableException being thrown.

My maven dependency for Spark:

<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.9.3</artifactId>
<version>0.8.0-incubating</version>

The Java code I'm using:

public static void main(String[] args) {

    JavaSparkContext sc = new JavaSparkContext("local", "spark");

    JavaRDD<String> lines = sc.textFile("stream.sample.txt").cache();

    JavaPairRDD<String, Long> words = lines
        .map(new Function<String, JsonElement>() {
            // parse line into JSON
            @Override
            public JsonElement call(String t) throws Exception {
                return (new JsonParser()).parse(t);
            }

        }).map(new Function<JsonElement, String>() {
            // read User ID from JSON
            @Override
            public String call(JsonElement json) throws Exception {
                return json.getAsJsonObject().get("userId").toString();
            }

        }).map(new PairFunction<String, String, Long>() {
            // count each line 
            @Override
            public Tuple2<String, Long> call(String arg0) throws Exception {
                return new Tuple2(arg0, 1L);
            }

        }).reduceByKey(new Function2<Long, Long, Long>() {
            // count messages for every user
            @Override
            public Long call(Long arg0, Long arg1) throws Exception {
                return arg0 + arg1;
            }

        });

    // sort result in a descending order and take 10 users with highest message count
    // This causes the exception
    List<Tuple2<String, Long>> sorted = words.takeOrdered(10, new Comparator<Tuple2<String, Long>> (){

        @Override
        public int compare(Tuple2<String, Long> o1, Tuple2<String, Long> o2) {
            return -1 * o1._2().compareTo(o2._2());
        }

    });

    // print result
    for (Tuple2<String, Long> tuple : sorted) {
        System.out.println(tuple._1() + ": " + tuple._2());
    }

}

The resulting stack trace:

java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
    at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.spark.SparkException: Job failed: java.io.NotSerializableException: net.imagini.spark.test.App$5
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:556)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:670)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:668)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:668)
    at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:376)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
    at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149)

I went through the Spark API documentation but couldn't find anything which would point me the right direction. Am I doing something wrong or is this a bug in Spark? Any help would be gladly appreciated.

Reproduce answered 17/10, 2013 at 17:13 Comment(1)
UPDATE: Apparently, it all boils down to the Comparator object which is being passed as the second argument to takeOrdered(). As the Comparator interface does not extend Serializable in order to make this work you need to create a 'serializable' comparator: public interface SerializableComparator<T> extends Comparator<T>, Serializable { } Subsequently, passing an object which implements this interface as the comparator prevents the original exception. Granted, this probably isn't the most elegant solution to this problem and I would definitely welcome some suggestions :)Reproduce
T
5

As @vanco.anton alluded to, you can do something like the following using Java 8 functional interfaces:

public interface SerializableComparator<T> extends Comparator<T>, Serializable {

  static <T> SerializableComparator<T> serialize(SerializableComparator<T> comparator) {
    return comparator;
  }

}

And then in your code:

import static SerializableComparator.serialize;
...
rdd.top(10, serialize((a, b) -> -a._2.compareTo(b._2)));
Thoroughfare answered 27/7, 2015 at 0:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.