What is the principle of "code moving to data" rather than data to code?
Asked Answered
V

2

8

In a recent discussion about distributed processing and streaming I came across the concept of 'code moving to data'. Can someone please help explaining the same. Reference for this phrase is MapReduceWay.

In terms of Hadoop, it's stated in a question but still could not figure out an explanation of the principle in a tech agnostic way.

Vestal answered 15/11, 2016 at 4:21 Comment(0)
C
13

The basic idea is easy: if code and data are on different machines, one of them must be moved to the other machine before the code can be executed on the data. If the code is smaller than the data, better to send the code to the machine holding the data than the other way around, if all the machines are equally fast and code-compatible. [Arguably you can send the source and JIT compile as needed].

In the world of Big Data, the code is almost always smaller than the data.

On many supercomputers, the data is partitioned across many nodes, and all the code for the entire application is replicated on all nodes, precisely because the entire application is small compared to even the locally stored data. Then any node can run the part of the program that applies to the data it holds. No need to send the code on demand.

Cuttler answered 15/11, 2016 at 4:25 Comment(0)
C
1

I also just came across the sentence “Moving Computation is Cheaper than Moving Data” (from the Apache Hadoop documentation) and after some reading I think this refers to the principle of data locality.

Data locality is a strategy for task scheduling aimed at optimizing performance based on the observation that moving data across a network is costly, so when choosing which task to prioritize whenever a computing/data node is free, preference will be given to the task that's going to operate on the data in the free node or in its proximity.

This (from Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, Zaharia et al., 2010) explains it clearly:

Hadoop’s default scheduler runs jobs in FIFO order, with five priority levels. When the scheduler receives a heartbeat indicating that a map or reduce slot is free, it scans through jobs in order of priority and submit time to find one with a task of the required type. For maps, Hadoop uses a locality optimization as in Google’s MapReduce [18]: after selecting a job, the scheduler greedily picks the map task in the job with data closest to the slave (on the same node if possible, otherwise on the same rack, or finally on a remote rack).

Note that the fact Hadoop replicates data across nodes increases fair scheduling of tasks (the higher the replication, the higher the probability of a task to have data on the next free node and hence get picked to run next).

Cyndi answered 27/6, 2018 at 19:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.