Pig: Force one mapper per input line/row
Asked Answered
U

1

6

I have a Pig Streaming job where the number of mappers should equal the number of rows/lines in the input file. I know that setting

set mapred.min.split.size 16 
set mapred.max.split.size 16
set pig.noSplitCombination true 

will ensure that each block is 16 bytes. But how do I ensure that each map job has exactly one line as input? The lines are variable length, so using a constant number for mapred.min.split.size and mapred.max.split.size is not the best solution.

Here is the code I intend to use:

input = load 'hdfs://cluster/tmp/input';
DEFINE CMD `/usr/bin/python script.py`;
OP = stream input through CMD;
dump OP;

SOLVED! Thanks to zsxwing

And, in case anyone else runs into this weird nonsense, know this:

To ensure that Pig creates one mapper for each input file you must set

set pig.splitCombination false

and not

set pig.noSplitCombination true

Why this is the case, I have no idea!

Unvoice answered 11/6, 2013 at 22:25 Comment(4)
It's very strange that using one mapper to handle only one line. Why do you have such strange requirement?Cavuoto
I'm doing cross-validation for a machine learning job. Each line is a set of parameters. I have anywhere between 10 and 500 lines. What's counter-intuitive is that each line is actually an input into a complicated algorithm, and takes 5-ish minutes of actual compute time.Unvoice
How about splitting your one input files to many files (each file contains only one line) at first? You can write a python UDF to do the splitting job.Cavuoto
Great, thank you! I'll write my solution up above.Unvoice
C
7

Following your clue, I browsed the Pig source codes to find out the answer.

Set pig.noSplitCombination in the Pig script does't work. In the Pig script, you need to use pig.splitCombination. Then Pig will set the pig.noSplitCombination in JobConf according to the value of pig.splitCombination.

If you want to set pig.noSplitCombination directly, you need to use the command line. For example,

pig -Dpig.noSplitCombination=true -f foo.pig

The difference between these two ways is: if you use set instruction in the Pig script, it is stored in Pig properties. If you use -D, it is stored in Hadoop Configuration.

If you use set pig.noSplitCombination true, then (pig.noSplitCombination, true) is stored in Pig properties. But when Pig wants to init a JobConf, it fetches the value using pig.splitCombination from Pig properties. So your setting has not effect. Here is the source codes. The correct way is set pig.splitCombination false as you mentioned.

If you use -Dpig.noSplitCombination=true, (pig.noSplitCombination, true) is stored in Hadoop Configuration. Since JobConf is copied from Configuration, the value of -D is directly passed to JobConf.

At last, PigInputFormat reads pig.noSplitCombination from JobConf to decide if using the combination. Here is the source codes.

Cavuoto answered 14/6, 2013 at 2:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.