I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind.
Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task and should be able to decode the code(spark APIs) in spark application sent to cluster by the driver?
It says in the documentation "Ensure that
HADOOP_CONF_DIR
orYARN_CONF_DIR
points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?
Spark
no longer uses--num-executors
inYARN
mode – Rosinski