I am learning about Amazon EMR lately, and according to my knowledge the EMR cluster lets us choose 3 nodes.
- Master which runs the Primary Hadoop daemons like NameNode,Job Tracker and Resource manager.
- Core which runs Datanode and Tasktracker daemons.
- Task which only runs TaskTracker only.
My question to you guys in why does EMR provide task nodes? Where as hadoop suggests that we should have Datanode daemon and Tasktracker daemon on the same node. What is Amazon's logic behind doing this? You can keep data in S3 stream it to HDFS on the core nodes, do the processing on HDFS other than sharing data from HDFS to task nodes which will increase IO over head in that case. Because as far as my knowledge in hadoop, TaskTrackers run on DataNodes which have data blocks for that particular task then why have TaskTrackers on different nodes?