How to find the right portion between hadoop instance types
Asked Answered
U

1

7

I am trying to find out how many MASTER, CORE, TASK instances are optimal to my jobs. I couldn't find any tutorial that explains how do I figure it out.

  • How do I know if I need more than 1 core instance? What are the "symptoms" I would see in EMR's console in the metrics that would hint I need more than one core? So far when I tried the same job with 1*core+7*task instances it ran pretty much like on 8*core, but it doesn't make much sense to me. Or is it possible that my job is so much CPU bound that the IO is such minor? (I have a map-only job that parses apache log files into csv file)

  • Is there such a thing to have more than 1 master instance? If yes, when is it needed? I wonder, because my master node pretty much is just waiting for the other nodes to do the job (0%CPU) for 95% of the time.

  • Can the master and the core node be identical? I can have a master only cluster, when the 1 and only node does everything. It looks like it would be logical to be able to have a cluster with 1 node that is the master and the core , and the rest are task nodes, but it seems to be impossible to set it up that way with EMR. Why is that?

Uella answered 29/4, 2014 at 9:29 Comment(0)
A
1

The master instance acts as a manager and coordinates everything that goes in the whole cluster. As such, it has to exist in every job flow you run but just one instance is all you need. Unless you are deploying a single-node cluster (in which case the master instance is the only node running), it does not do any heavy lifting as far as actual MapReducing is concerned, so the instance does not have to be a powerful machine.

The number of core instances that you need really depends on the job and how fast you want to process it, so there is no single correct answer. A good thing is that you can resize the core/task instance group, so if you think your job is running slow, then you can add more instances to a running process.

One important difference between core and task instance groups is that the core instances store actual data on HDFS whereas task instances do not. In turn, you can only increase the core instance group (because removing running instances would lose the data on those instances). On the other hand, you can both increase and decrease the task instance group by adding or removing task instances.

So these two types of instances can be used to adjust the processing power of your job. Typically, you use ondemand instances for core instances because they must be running all the time and cannot be lost, and you use spot instances for task instances because losing task instances do not kill the entire job (e.g., the tasks not finished by task instances will be rerun on core instances). This is one way to run a large cluster cost-effectively by using spot instances.

The general description of each instance type is available here:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/InstanceGroups.html

Also, this video may be useful for using EMR effectively:

https://www.youtube.com/watch?v=a5D_bs7E3uc

Ankylosis answered 30/4, 2014 at 6:15 Comment(6)
All this I know. But how will I know if 1 core is enough and all the REST (let it be 8 or 64 or 200) can be task, or there need to be a proportion (obviously dependent on my job) that if I pass (add too many task instances) will slow down the job because there are not "enough" core instances, and the IO of the core instance (that all the task instances use if I understand this) will become the bottle-neckUella
Oh I see. That's an interesting question, but I'm afraid that I've never experimented with that in mind. I don't think I ever saw my jobs running less efficient by adding more task nodes, but that might just be my jobs. You could figure that out yourself, by monitoring your job on the EMR console and resizing core/task groups on a long running job.Ankylosis
Yes, but my question is exactly that: what measures on the EMR console are relevant for this? Currently the only thing I did was running the same job with different configurations and measure the time it took, but it's not too accurate.Uella
Did you get a chance to find answer for this question?Heteropolar
It's probably one of those generic questions for which some general rule of thumb may exist, but the best solution really depends on situation such that a simple guideline would not be satisfying (you can use different types of instances for master, core, task groups, etc., which already is quite complicated). I attended the very AWS reInvent session (from which the video I linked above) exactly wanting to know the kind of information, but I didn't get more than "all that, I already know." I'm curious myself if useful good practices are available.Ankylosis
There is more information about this in safaribooksonline.com/library/view/…. A key point is that data to and from the task nodes must go through a master node. I would guess that the number of task nodes can thus be increased until they stop running at full capacity: I would take this as an indication that data is not flowing fast enough through the master node (if you can monitor the data transfer speed on the master, this would correspond to a saturating rate).Stackhouse

© 2022 - 2024 — McMap. All rights reserved.