Spark + EMR using Amazon's "maximizeResourceAllocation" setting does not use all cores/vcores
Asked Answered
L

3

23

I'm running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, "this option calculates the maximum compute and memory resources available for an executor on a node in the core node group and sets the corresponding spark-defaults settings with this information".

I'm running the cluster using m3.2xlarge instances for the worker nodes. I'm using a single m3.xlarge for the YARN master - the smallest m3 instance I can get it to run on, since it doesn't do much.

The situation is this: When I run a Spark job, the number of requested cores for each executor is 8. (I only got this after configuring "yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" which isn't actually in the documentation, but I digress). This seems to make sense, because according to these docs an m3.2xlarge has 8 "vCPUs". However, on the actual instances themselves, in /etc/hadoop/conf/yarn-site.xml, each node is configured to have yarn.nodemanager.resource.cpu-vcores set to 16. I would (at a guess) think that must be due to hyperthreading or perhaps some other hardware fanciness.

So the problem is this: when I use maximizeResourceAllocation, I get the number of "vCPUs" that the Amazon Instance type has, which seems to be only half of the number of configured "VCores" that YARN has running on the node; as a result, the executor is using only half of the actual compute resources on the instance.

Is this a bug in Amazon EMR? Are other people experiencing the same problem? Is there some other magic undocumented configuration that I am missing?

Lovesome answered 30/11, 2015 at 16:51 Comment(0)
L
53

Okay, after a lot of experimentation, I was able to track down the problem. I'm going to report my findings here to help people avoid frustration in the future.

  • While there is a discrepancy between the 8 cores asked for and the 16 VCores that YARN knows about, this doesn't seem to make a difference. YARN isn't using cgroups or anything fancy to actually limit how many CPUs the executor can actually use.
  • "Cores" on the executor is actually a bit of a misnomer. It is actually how many concurrent tasks the executor will willingly run at any one time; essentially boils down to how many threads are doing "work" on each executor.
  • When maximizeResourceAllocation is set, when you run a Spark program, it sets the property spark.default.parallelism to be the number of instance cores (or "vCPUs") for all the non-master instances that were in the cluster at the time of creation. This is probably too small even in normal cases; I've heard that it is recommended to set this at 4x the number of cores you will have to run your jobs. This will help make sure that there are enough tasks available during any given stage to keep the CPUs busy on all executors.
  • When you have data that comes from different runs of different spark programs, your data (in RDD or Parquet format or whatever) is quite likely to be saved with varying number of partitions. When running a Spark program, make sure you repartition data either at load time or before a particularly CPU intensive task. Since you have access to the spark.default.parallelism setting at runtime, this can be a convenient number to repartition to.

TL;DR

  1. maximizeResourceAllocation will do almost everything for you correctly except...
  2. You probably want to explicitly set spark.default.parallelism to 4x number of instance cores you want the job to run on on a per "step" (in EMR speak)/"application" (in YARN speak) basis, i.e. set it every time and...
  3. Make sure within your program that your data is appropriately partitioned (i.e. want many partitions) to allow Spark to parallelize it properly
Lovesome answered 2/12, 2015 at 11:5 Comment(8)
I'd add that using Spark's dynamic resource calculation may be of use here too (docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/…). Plus, user can inflate the vcores for an instance type to achieve a good per task per executor balance with real CPU usage.Kial
In EMR 5.x, maximizeResourceAllocation sets spark.default.parallelism to twice the number of total available CPU cores emr-spark-maximizeresourceallocationVic
@Lovesome Can you give an example of "4x number of instance cores you want the job to run on on a per "step" (in EMR speak)/"application" (in YARN speak) basis"? I am not understanding what this means exactly.Mccarthy
@Mccarthy It's a long time since I've touched this, so things are out of date, but roughly, continuing from the (rather old) info in the original post - say I wanted a job to use all of the cores on 4 m3.2xlarge's - which each have 8 "vCPUs" - that's 32 cores/vCPUs - I'd want to set spark.default.parallelism to 128 (32 x 4) on each job. I'd also make sure my program repartitioned the data at appropriate points, so that there were at least 128 partitions to work on. But this may no longer be necessary, this was on EMR 4.2.0, it may be okay in 5.x+, see @donghyun208's comment above.Lovesome
@Lovesome Thanks! How would you suggest setting spark.sql.shuffle.partitions? Would you suggest setting spark.sql.shuffle.partitions to the same value as spark.default.parallelism?Mccarthy
@Mccarthy I don't know, sorry - I wasn't using Spark SQL at the time, and I haven't touched it in about 3-4 years... try googling around for other recommendations, perhaps?Lovesome
@Lovesome - so are you using both "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" AND "maximizeResourceAllocation": "true" in your config? thanks!Spray
@RussellBurdt I haven't touched this project in many years (several jobs ago!), so I don't know what the specific settings are for recent versions of EMR. I would hope it's smarter now than it was then!Lovesome
H
4

With this setting you should get 1 executor on each instance (except the master), each with 8 cores and about 30GB of RAM.

Is the Spark UI at http://:8088/ not showing that allocation?

I'm not sure that setting is really a lot of value compared to the other one mentioned on that page, "Enabling Dynamic Allocation of Executors". That'll let Spark manage it's own number of instances for a job, and if you launch a task with 2 CPU cores and 3G of RAM per executor you'll get a pretty good ratio of CPU to memory for EMR's instance sizes.

Hebdomad answered 30/11, 2015 at 17:12 Comment(2)
It does show that; but the problem is those "8" cores are actually only 8 of 16 "VCores" allocated by YARN; half the actual CPU resources on the machine are being left idle. Since I am trying to run CPU intensive jobs, this is a waste of CPU (and money, obviously!)Lovesome
The cores are just an abstraction on the instance type itself. There is no actual binding to the cores and so the executors will use however much CPU the application requests. The only binding comes in when using DominantResourceCalculator for scheduler. One item to note is for this instance type EMR default config double the vcore value told to yarn to help improve CPU utilization with MapReduce. The maximizeResourceAllocation was looking at instance type core definition.Kial
T
0

in the EMR version 3.x, this maximizeResourceAllocation was implemented with a reference table: https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/vcorereference.tsv

it used by a shell script: maximize-spark-default-config, in the same repo, you can take a look how they implemented this.

maybe in the new EMR version 4, this reference table was somehow wrong... i believe you can find all this AWS script in your EC2 instance of EMR, should be located in /usr/lib/spark or /opt/aws or something like this.

anyway, at least, you can write your own bootstrap action scripts for this in EMR 4, with a correct reference table, similar to the implementation in EMR 3.x

moreover, since we are going to use STUPS infrastructure, worth take a look the STUPS appliance for Spark: https://github.com/zalando/spark-appliance

you can explicitly specify the number of cores by setting senza parameter DefaultCores when you deploy your spark cluster

some of highlight of this appliance comparing to EMR are:

able to use it with even t2 instance type, auto-scalable based on roles like other STUPS appliance, etc.

and you can easily deploy your cluster in HA mode with zookeeper, so no SPOF on master node, HA mode in EMR is currently still not possible, and i believe EMR is mainly designed for "large clusters temporarily for ad-hoc analysis jobs", not for "dedicated cluster that is permanently on", so HA mode will not be possible in near further with EMR.

Thriller answered 2/12, 2015 at 13:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.