Understanding a mapreduce algorithm for overlap calculation
Asked Answered
T

1

6

I want help understanding the algorithm. I ve pasted the algorithm explanation first and then my doubts.

Algorithm:( For calculating the overlap between record pairs)

Given a user defined parameter K, the file DR( *Format: record_id, data*) is split into K nearly equi-sized chunks, such that the data of a document, Di falls into the i/K th chunk.

We overrode Hadoop’s partitioning function which maps a key emitted by the mapper to a reducer instance. Every key (i,j) is mapped to a reducer in the j/Kth group.

The special key i,* and its associated value, i.e, the document's data are replicated at most K times, so that the full content of the document can be delivered at every reducer. Each reducer in a group thus needs to recover and load in memory only one chunk of DR file, whose size can be set arbitrarily small by varying K. Thus overlap can be calculated. This is achieved at the cost of replicating the documents delivered through the MapReduce framework.

Doubts:

I have made some assumptions:

Statement: Every key (i,j) is mapped to a reducer in the j/Kth group. Assumption: K reduce nodes are present, and the the key is mapped to j/Kth reduce node.

Doubt: Are some reduce nodes grouped together? say 0,1,2 nodes are grouped as Group-0?

Statement: the document's data are replicated at most K times, so that the full content of the document can be delivered at every reducer.

So that means K equals no. of reducer nodes? If not then, we are wasting compute nodes right, without using them right?

Main Doubt: Is K equal to the number of Reducer Nodes??

Hoping for responses!

Thanks!

Tucson answered 10/3, 2013 at 6:5 Comment(6)
I don't think you're giving us enough information to understand this algorithm...Scan
Basically, there are two types of Mapper outputs: 1. <i,i><value>,Tucson
Basically, there are two types of mapper output:a. key: <i,j>, val: <intersection> b. key: <i,*>, val: <data>. The first type of outputs can reach any reducer based on the second part of the second part of the key, i.e. j. The goal is to group together at any reduce instance records from any mapper output that has same "i" in first part of the key. <i,*> key must be replicated k times to reach every reduce instance, so that every intersection record with i as the first part of the key gets the i data. Does this mean k must be replicated as many times as the no. of reduce nodes???Tucson
This seems like a lot of replication...Tucson
I'm still not clear on: (1) what this map/reduce program is trying to do; (2) what overriding "Hadoop’s partitioning function" means; (3) why you need two different keys/values and how do they get produced in the map phase (the mapper allows only one type for key and one type for value); (4) why this can't be designed using two or more map/reduce algorithm so that no "hack" (which is always dangerous and not portable) need to be made and the MR paradigm can be respected.Scan
Not cleat either on what you are trying to do. What do you mean by "overlap"? Overlap between what and what? Please provide some minimal example of input (or at least of the mapper output) and the expected output out of the reduce stage.Volney
I
0

Test the same program breaking after nodes become unnecessary against your current program. I find that it’s usually better to break an operation once it’s finished.

However, if the operation only just knows to process later on during operation then it might be necessary to allow the code to continue through to the end.

Ingeingeberg answered 18/4, 2013 at 11:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.