I have a Pig Streaming job where the number of mappers should equal the number of rows/lines in the input file. I know that setting
set mapred.min.split.size 16
set mapred.max.split.size 16
set pig.noSplitCombination true
will ensure that each block is 16 bytes. But how do I ensure that each map job has exactly one line as input? The lines are variable length, so using a constant number for mapred.min.split.size
and mapred.max.split.size
is not the best solution.
Here is the code I intend to use:
input = load 'hdfs://cluster/tmp/input';
DEFINE CMD `/usr/bin/python script.py`;
OP = stream input through CMD;
dump OP;
SOLVED! Thanks to zsxwing
And, in case anyone else runs into this weird nonsense, know this:
To ensure that Pig creates one mapper for each input file you must set
set pig.splitCombination false
and not
set pig.noSplitCombination true
Why this is the case, I have no idea!