As spark runs in-memory what does resource allocation mean in Spark when running on yarn and how does it contrast with hadoop's container allocation? Just curious to know as hadoop's data and computations are on the disk where as Spark is in-memory.
Hadoop is a framework capable of processing large data. It has two layers. One is a distributed file system layer called HDFS and the second one is the distributed processing layer. In hadoop 2.x, the processing layer is architectured in a generic way so that it can be used for non-mapreduce applications also. For doing any process, we need system resouces such as memory, network, disk and cpu. The term container came in hadoop 2.x. In hadoop 1.x, the equivalent term was slot. A container is an allocation or share of memory and cpu. YARN is a general resource management framework which enables efficient utilization of the resources in the cluster nodes by proper allocation and sharing.
In-memory process means, the data will be completely loaded into memory and processed without writing the intermediate data to the disk. This operation will be faster as the computation happens in memory without much disk I/O operations. But this needs more memory because the entire data will be loaded into the memory.
Batch process means the data will be taken and processed in batches, intermediate results will be stored in the disk and again supplied to the next process. This also needs memory and cpu for processing, but it will be less as compared to that of fully in-memory processing systems.
YARN's resource manager act as the central resource allocator for applications such as mapreduce, impala (with llama), spark (in yarn mode) etc. So when we trigger a job, it will request the resource manager for the resources required for execution. The resource manager will allocate resources based on the availability. The resources will be allocated in the form of containers. Container is just an allocation of memory and cpu. One job may need multiple containers. Containers will be allocated across the cluster depending upon the availability. The tasks will be executed inside the container.
For example, When we submit a mapreduce job, an MR application master will be launched and it will negotiate with the resource manager for additional resources. Map and reduce tasks will be spawned in the allocated resources.
Similarly when we submit a spark job (YARN mode), a spark application master will be launched and it will negotiate with the resource manager for additional resources. The RDD's will be spawned in the allocated resources.
© 2022 - 2024 — McMap. All rights reserved.