Running steps of EMR in parallel
Asked Answered
T

4

7

I am running a spark-job on EMR cluster,The issue i am facing is all the

EMR jobs triggered are executing in steps (in queue)

Is there any way to make them run parallel if not is there any alteration for that

Tabethatabib answered 30/3, 2017 at 14:54 Comment(2)
Are you using EMR STEP API to submit spark jobs and concerned that the STEPS are running in sequence ? Or Is it the concern on YARN jobs submitted by spark that are running in Queue ?Arduous
AWS have just released running steps in parallelRisteau
P
4

Elastic MapReduce comes by default with a YARN setup very "step" oriented, with a single CapacityScheduler queue with the 100% of the cluster resources assigned. Because of this configuration, any time you submit a job to an EMR cluster, YARN maximizes the cluster usage for that single job, granting all available resources to it until it finishes.

Running multiple concurrent jobs in an EMR cluster (or any other YARN based Hadoop cluster, in fact) requires a proper YARN setup with multiple queues to properly grant resources to each job. YARN's documentation is quite good about all of the Capacity Scheduler features and it is simpler as it sounds.

YARN's FairScheduler is quite popular but it uses a different approach and may be a bit more difficult to configure depending on your needs. Given the simplest scenario where you have a single Fair queue, YARN will try to grant containers to waiting jobs as soon as they are freed by running jobs, ensuring that all the jobs submitted to a cluster get at least a fraction of compute resources as soon as they are available.

Pandich answered 10/4, 2017 at 22:24 Comment(1)
Is there any official AWS documentation for this?Swafford
A
2

If you are concerned about YARN jobs running in a queue(submitted by spark)..

There are multiple solutions to run jobs in parallel ,

By default, EMR uses YARN CapacityScheduler with DefaultResourceCalculator and has one single DEFAULT queue where all YARN jobs are submitted. SInce there is only one queue, the number of yarn jobs that you can RUN(not submit) in parallel really depends on the parallel number of AM's , mapper and reducers that your EMR cluster supports.

For example : You have a cluster that can run atmost 10 mappers in parallel. (see AWS EMR Parallel Mappers?)

Suppose you submitted 2 map-only jobs each requiring 10 mappers one after another. The first job will take up all mapper container capacity and runs , while the second waits on the queue for the containers to free up. This behavior is similar for AM's and Reducers as well.

Now, to make them run in parallel inspire of having that limitation on number of containers that is supported by cluster ,

  1. Keeping capacity scheduler , You can create multiple queues configuring %'s of capacity with Max capacity in each queue. So that job in first queue might not fully use up all containers even though it needs it. You can submit a seconds your job in second queue which will have pre-determined capacity.

  2. You might need to use FAIR scheduler by configuring yarn-site.xml . The FAIR scheduler allows you share configure queues and share resources across those queues fairly. You might also use PREEMPTION option of fair scheduler.

Note that the choice of what option to go with - really depends on your use-case and business needs. It is important to learn about all options and possible impact.

https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781491901687/ch04.html

Arduous answered 17/4, 2017 at 5:2 Comment(1)
Can the solution with keeping capacity scheduler be implemented in AWS EMR. We are looking to run couple of spark jobs in a high capacity cluster so that they run in parallel and only consume assigned compute resourcesKhoisan
R
2

Amazon EMR now supports the ability to run multiple steps in parallel. The number of steps allowed to run at once is configurable and can be set when a cluster is launched and at any time after the cluster has started.

Please see this announcement for more details: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/.

Rafaellle answered 9/12, 2019 at 17:55 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.