Pig: Control number of mappers

Asked 16/6, 2014 at 7:13 Answered 16/6, 2014 at 10:42

I can control the number of reducers by using PARALLEL clause in the statements which result in reducers.

I want to control the number of mappers. The data source is already created, and I can not reduce the number of parts in the data source. Is it possible to control the number of maps spawned by my pig statements? Can I keep a lower and upper cap on the number of maps spawned? Is it a good idea to control this?

I tried using pig.maxCombinedSplitSize, mapred.min.split.size, mapred.tasktracker.map.tasks.maximum etc, but they seem to not help.

Can someone please help me understand how to control the number of maps and possibly share a working example?

Magalymagan answered 16/6, 2014 at 7:13 Comment(2)

What is the nature of your data? Size, number of small files per projection? – Kaine 16/6, 2014 at 9:24

@alexeipab, my input data is of couple (7 to 8) of GBs with couple of (10 to 20) mbs data per part file. Do these parameters affect? My question was rather generic. I want to understand different ways to control number of mappers. – Magalymagan 16/6, 2014 at 9:49

There is a simple rule of thumb for number of mappers: There is as many mappers as there are file splits. A file split depends on the size of the block into which you HDFS splits the files (64MB, 128MB, 256MB depending on your configuration), please note that FileInput formats take into account, but can define their own behaviour.

Splits are important, because they are tied to the physical location of the data in the cluster, Hadoop brings code to the data and not data to the code.

The problem arises when the size of the file is less than the size of the block (64MB, 128MB, 256MB), this means there will be as many splits as there are input files, which is not efficient, as each Map Task usually startup time. In this case your best bet is to use pig.maxCombinedSplitSize, as it will try to read multiple small files into one Mapper, in a way ignore splits. But if you make it too large you run a risk of bringing data to the code and will run into network issues. You could have network limitations if you force too few Mappers, as data will have to be streamed from other data nodes. Keep the number close to the block size or half of it and you should be fine.

Other solution might be to merge the small files into one large splitable file, that will automatically generate and efficient number of Mappers.

Kaine answered 16/6, 2014 at 10:7 Comment(3)

Thanks for the answer. As I mentioned in the question, I tried setting pig.maxCombinedSplitSize to my block size (512MB), but that did not change the number of mappers at all. The current number of mappers spawned are approx 2500. Do I need to set anything else apart from pig.maxCombinedSplitSize? – Magalymagan 16/6, 2014 at 11:54

Check this one pig.splitCombination – Turns combine split files on or off (set to “true” by default). – Kaine 16/6, 2014 at 12:23

thanks! it helped. I did not need to set pig.maxCombinedSplitSize to true, as true is the default value. I was not trying with good enough value for pig.maxCombinedSplitSize to see the number of mappers going down. As an experiment, I tried setting it to 2GB and it showed the effect. I will set the value to half the data block size. – Magalymagan 17/6, 2014 at 10:20

You can change the property mapred.map.tasks to number you want. THis property contains default map task/job. Instead of setting it globally set the property for your session so default will be restored once your job is done.

Weisbart answered 16/6, 2014 at 10:42 Comment(0)

Recommended topics

Hot tags