I want help understanding the algorithm. I ve pasted the algorithm explanation first and then my doubts.
Algorithm:( For calculating the overlap between record pairs)
Given a user defined parameter K, the file DR( *Format: record_id, data*) is split into K nearly equi-sized chunks, such that the data of a document, Di falls into the i/K th chunk.
We overrode Hadoop’s partitioning function which maps a key emitted by the mapper to a reducer instance. Every key (i,j) is mapped to a reducer in the j/Kth group.
The special key i,* and its associated value, i.e, the document's data are replicated at most K times, so that the full content of the document can be delivered at every reducer. Each reducer in a group thus needs to recover and load in memory only one chunk of DR file, whose size can be set arbitrarily small by varying K. Thus overlap can be calculated. This is achieved at the cost of replicating the documents delivered through the MapReduce framework.
Doubts:
I have made some assumptions:
Statement: Every key (i,j) is mapped to a reducer in the j/Kth group. Assumption: K reduce nodes are present, and the the key is mapped to j/Kth reduce node.
Doubt: Are some reduce nodes grouped together? say 0,1,2 nodes are grouped as Group-0?
Statement: the document's data are replicated at most K times, so that the full content of the document can be delivered at every reducer.
So that means K equals no. of reducer nodes? If not then, we are wasting compute nodes right, without using them right?
Main Doubt: Is K equal to the number of Reducer Nodes??
Hoping for responses!
Thanks!