Setting hadoop parameters with boto?
Asked Answered
S

1

8

I am trying to enable bad input skipping on my Amazon Elastic MapReduce jobs. I am following the wonderful recipe described here:

http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code

The link above says that I need to somehow set the following configuration parameters on an EMR job:

mapred.skip.mode.enabled=true
mapred.skip.map.max.skip.records=1
mapred.skip.attempts.to.start.skipping=2
mapred.map.tasks=1000
mapred.map.max.attempts=10

How do I set these (and other) mapred.XXX parameters on a JobFlow using Boto?

Sec answered 22/8, 2012 at 10:48 Comment(0)
S
14

After many hours of struggling, reading code, and experimentation, here is the answer:

You need to add a new BootstrapAction, like so:

params = ['-s','mapred.skip.mode.enabled=true',
          '-s', 'mapred.skip.map.max.skip.records=1',
          '-s', 'mapred.skip.attempts.to.start.skipping=2',
          '-s', 'mapred.map.max.attempts=5',
          '-s', 'mapred.task.timeout=100000']
config_bootstrapper = BootstrapAction('Enable skip mode', 's3://elasticmapreduce/bootstrap-actions/configure-hadoop', params)

conn = EmrConnection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
step = StreamingStep(name='My Step', ...)
conn.run_jobflow(..., bootstrap_actions=[config_bootstrapper], steps=[step], ...)

Of course, if you have more than one bootstrap action, you should just add it to the bootstrap_actions array.

Sec answered 22/8, 2012 at 10:48 Comment(1)
Thanks! That worked for me. It sometimes works when I specify the same parameters with ['-D', '...'] for the same set of values with a "step" instead of as a bootstrap, but adding this bootstrap step seems to make this bullet-proof.Beware

© 2022 - 2024 — McMap. All rights reserved.