I'm trying to run spark on a working hadoop cluster. When I run my python job with a small dataset size, everything seems to work fine. However when I use a larger dataset, the task fails and in the hadoop resource manager I get the diagnostic:
Shutdown hook called before final status was reported.
The command I use to run the job is:
spark-submit --master yarn --deploy-mode cluster --conf \
spark.yarn.appMasterEnv.SPARK_HOME=/dev/null --conf \
spark.executorEnv.SPARK_HOME=/dev/null project-spark.py
It's just a test code that generates some data and runs Spark's KMeans algorithm on the generated data.
Any Ideas what I should be doing? Any help is greatly appreciated...
Also I am using Spark v2.0.0 on a Hadoop v2.6.0 cluster consisting of 4 workers and using Anaconda2 v4.1.1
____ Update
As @rakesh.rakshit suggested I ran the job with the parameters --master yarn-client
and monitored the task. I found out that as @ShuaiYuan suggested I actually had a memory intensive part that wasn't done through Spark functions which was causing the problem.
Also, it seems as off Spark 1.4.0 it is not required to set SPARK_HOME
variable since this issue was resolved.