I am running a spark cluster over C++ code wrapped in python. I am currently testing different configurations of multi-threading options (at Python level or Spark level).
I am using spark with standalone binaries, over a HDFS 2.5.4 cluster. The cluster is currently made of 10 slaves, with 4 cores each.
From what I can see, by default, Spark launches 4 slaves per node (I have 4 python working on a slave node at a time).
How can I limit this number ? I can see that I have a --total-executor-cores option for "spark-submit", but there is little documentation on how it impacts the distribution of executors over the cluster !
I will run tests to get a clear idea, but if someone knowledgeable has a clue of what this option does, it could help.
Update :
I went through spark documentation again, here is what I understand :
- By default, I have one executor per worker node (here 10 workers node, hence 10 executors)
- However, each worker can run several tasks in parallel. In standalone mode, the default behavior is to use all available cores, which explains why I can observe 4 python.
- To limit the number of cores used per worker, and limit the number of parallel tasks, I have at least 3 options :
- use
--total-executor-cores
whithspark-submit
(least satisfactory, since there is no clue on how the pool of cores is dealt with) - use
SPARK_WORKER_CORES
in the configuration file - use
-c
options with the starting scripts
- use
The following lines of this documentation http://spark.apache.org/docs/latest/spark-standalone.html helped me to figure out what is going on :
SPARK_WORKER_INSTANCES
Number of worker instances to run on each machine (default: 1). You can make this more than 1 if you have have very large machines and would like multiple Spark worker processes. If you do set this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker will try to use all the cores.
What is still unclear to me is why it is better in my case to limit the number of parallel tasks per worker node to 1 and rely on my C++ legacy code multithreading. I will update this post with experiment results, when I will finish my study.