In many real-world scenarios, the cardinality of values in a dataset will be relatively small. In such cases, the problem can be efficiently solved with two MapReduce jobs:
- Calculate frequencies of values in your dataset (Word Count job, basically)
- Identity mapper + a reducer which calculates median based on < value - frequency> pairs
Job 1. will drastically reduce the amount of data and can be executed fully in parallel. Reducer of job 2. will only have to process n
(n
= cardinality of your value set
) items instead of all values, as with the naive approach.
Below, an example reducer of the job 2. It's is python script that could be used directly in Hadoop streaming. Assumes values in your dataset are ints
, but can be easily adopted for double
s
import sys
item_to_index_range = []
total_count = 0
# Store in memory a mapping of a value to the range of indexes it has in a sorted list of all values
for line in sys.stdin:
item, count = line.strip().split("\t", 1)
new_total_count = total_count + int(count)
item_to_index_range.append((item, (total_count + 1, new_total_count + 1)))
total_count = new_total_count
# Calculate index(es) of middle items
middle_items_indexes = [(total_count / 2) + 1]
if total_count % 2 == 0:
middle_items_indexes += [total_count / 2]
# Retrieve middle item(s)
middle_items = []
for i in middle_items_indexes:
for item, index_range in item_to_index_range:
if i in range(*index_range):
middle_items.append(item)
continue
print sum(middle_items) / float(len(middle_items))
This answer builds up on top of a suggestion initially coming from the answer of Chris White. The answer suggests using a combiner as a mean to calculate frequencies of values. However, in MapReduce, combiners are not guaranteed to be always executed. This has some side effects:
- reducer will first have to compute final < value - frequency > pairs and then calculate median.
- In the worst case scenario, combiners will never be executed and the reducer will still have to struggle with processing all individual values