I'm running a spark job. It shows that all of the jobs were completed:
however after couple of minutes the entire job restarts, this time it will show all jobs and tasks were completed too, but after couple of minutes it will fail. I found this exception in the logs:
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
So this happens when I'm trying to join 2 pretty big tables: one of 3B rows, and the second is 200M rows, when I run show(100)
on the resulting dataframe, everything gets evaluated and I'm getting this issue.
I tried playing around with increasing/decreasing the number of partitions, I changed the garbage collector to G1 with increased number of threads. I changed spark.sql.broadcastTimeout
to 600 (which made the time out message to change to 600 seconds).
I also read that this might be a communication issue, however other show()
clauses that run prior this code segment work without problems, so it's probably not it.
This is the submit command:
/opt/spark/spark-1.4.1-bin-hadoop2.3/bin/spark-submit --master yarn-cluster --class className --executor-memory 12g --executor-cores 2 --driver-memory 32g --driver-cores 8 --num-executors 40 --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:ConcGCThreads=20" /home/asdf/fileName-assembly-1.0.jar
you can get the idea about spark versions, and the resources used from there.
Where do I go from here? Any help will be appreciated, and code segments/additional logging will be provided if needed.