Spark on EMR-5.32.0 not spawning requested executors
Asked Answered
R

2

6

I am running into some problems in (Py)Spark on EMR (release 5.32.0). Approximately a year ago I ran the same program on an EMR cluster (I think the release must have been 5.29.0). Then I was able to configure my PySpark program using spark-submit arguments properly. However, now I am running the same/similar code, but the spark-submit arguments do not seem to have any effect.

My cluster configuration:

  • master node: 8 vCore, 32 GiB memory, EBS only storage EBS Storage:128 GiB
  • slave nodes: 10 x 16 vCore, 64 GiB memory, EBS only storage EBS Storage:256 GiB

I run the program with the following spark-submit arguments:

spark-submit --master yarn --conf "spark.executor.cores=3" --conf "spark.executor.instances=40" --conf "spark.executor.memory=8g" --conf "spark.driver.memory=8g" --conf "spark.driver.maxResultSize=8g" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.default.parallelism=480" update_from_text_context.py

I did not change anything in the default configurations on the cluster.

Below a screenshot of the Spark UI, which is indicating only 10 executors, whereas I expect to have 40 executors available...

enter image description here

I tried different spark-submit arguments in order to make sure that the error was unrelated to Apache Spark: setting executor instances does not change the executors. I tried a lot of things, and nothing seems to help.

I am a little lost here, could someone help?

UPDATE: I ran the same code on EMR release label 5.29.0, and there the conf settings in the spark-submit argument seems to have effect:

enter image description here

Why is this happening?

Redroot answered 5/1, 2021 at 11:56 Comment(0)
F
7

Sorry for the confusion, but this is intentional. On emr-5.32.0, Spark+YARN will coalesce multiple executor requests that land on the same node into a larger executor container. Note how even though you had fewer executors than you expected, each of them had more memory and cores that you had specified. (There's one asterisk here, though, that I'll explain below.)

This feature is intended to provide better performance by default in most cases. If you would really prefer to keep the previous behavior, you may disable this new feature by setting spark.yarn.heterogeneousExecutors.enabled=false, though we (I am on the EMR team) would like to hear from you about why the previous behavior is preferable.

One thing that doesn't make sense to me, though, is that you should be ending up with the same total number of executor cores that you would have without this feature, but that doesn't seem to have occurred for the example you shared. You asked for 40 executors with 3 cores each but then got 10 executors with 15 cores each, which is a bit more in total. This may have to do with the way that your requested spark.executor.memory of 8g divides into the memory available on your chosen instance type, which I'm guessing is probably m5.4xlarge. One thing that may help you is to remove all of your overrides for spark.executor.memory/cores/instances and just use the defaults. Our hope is that defaults will give the best performance in most cases. If not, like I said above, please let us know so that we can improve further!

Fiesta answered 12/2, 2021 at 16:32 Comment(3)
Also, sorry for taking over a month to answer this! I haven't been keeping up with StackOverflow as much as I'd like.Fiesta
Thanks for the extensive response! From what I have read so far I have seen that the optimal number of cores per executor is 3 or 4. So I tried to optimize for this, and expected a certain behaviour. I agree with this post forums.aws.amazon.com/thread.jspa?threadID=332806&tstart=0 that it is a bit weird to overide the Spark behaviour, without documenting it properly. I would suggest making it available by explicitly setting it in the cluster configuration. Also, do you have some explanation why larger executors would work better in most cases?Redroot
Hi Jonathan, I’m being stuck with this issue of why spark.executor.cores=1 but Yarn and spark always requested a single large container with 2 cores. This breaks our application as the task is not designed to run in a multithreaded fashion. I opened was support ticket but didn’t get much help. I will try yours tomorrow and let you know if this solves my problem. Thanks a bunch.Tonguetied
R
-1

Ok, if someone is facing the same problem. As a workaround you can just revert back to a previous version of EMR. In my example I reverted back to EMR release label 5.29.0, which solved all my problems. Suddenly I was able to configure the Spark job again!

Still I am not sure why it doesn't work in EMR release label 5.32.0. So if someone has suggestions, please let me know!

Redroot answered 7/1, 2021 at 9:2 Comment(1)
Please see my answer. Each EMR release contains further performance improvements for Spark, so you may be missing out on a lot of performance gains by sticking with an older release. (By this point, emr-5.29.0 is over a full year old, and we've done a lot since then.)Fiesta

© 2022 - 2024 — McMap. All rights reserved.