How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

Asked 14/2, 2012 at 20:45 Answered 31/10, 2012 at 20:40

Solved java hadoop mapreduce elastic-map-reduce emr

I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custom jar.

We can use the following way to specify these configurations when we run using external scripting languages like ruby or python:

ruby elastic-mapreduce -j --stream --step-name "mystream" --jobconf mapred.task.timeout=0 --jobconf mapred.min.split.size=52880 --mapper s3://somepath/mapper.rb --reducer s3:somepath/reducer.rb --input s3://somepath/input --output s3://somepath/output

I tried the following ways, but none of them worked:

ruby elastic-mapreduce --jobflow --jar s3://somepath/job.jar --arg s3://somepath/input --arg s3://somepath/output --args -m,mapred.min.split.size=52880 -m,mapred.task.timeout=0
ruby elastic-mapreduce --jobflow --jar s3://somepath/job.jar --arg s3://somepath/input --arg s3://somepath/output --args -jobconf,mapred.min.split.size=52880 -jobconf,mapred.task.timeout=0

I would also like to know how to pass java options to a streaming job using custom jar in EMR. When running locally on hadoop we can pass it as follows:

bin/hadoop jar job.jar input_path output_path -D< some_java_parameter >=< some_value >

Litotes answered 14/2, 2012 at 20:45 Comment(2)

I'm looking into this myself at the moment. Did you come up with anything? – Subarctic 30/3, 2012 at 14:10

@MichaelDellaBitta: I've just provided an answer, which might be useful, depending what you need to achieve in particular. – Blinker 5/4, 2012 at 10:24

I believe if you want to set these on a per-job basis, then you need to

A) for custom Jars, pass them into your jar as arguments, and process them yourself. I believe this can be automated as follows:

public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  args = new GenericOptionsParser(conf, args).getRemainingArgs();
  //....
}

Then create the job in this manner (haven't verified if works though):

 > elastic-mapreduce --jar s3://mybucket/mycode.jar \
    --args "-D,mapred.reduce.tasks=0"
    --arg s3://mybucket/input \
    --arg s3://mybucket/output

The GenericOptionsParser should automatically transfer the -D and -jobconf parameters into Hadoop's job setup. More details: http://hadoop.apache.org/docs/r0.20.0/api/org/apache/hadoop/util/GenericOptionsParser.html

B) for the hadoop streaming jar, you also just pass the configuration change to the command

> elastic-mapreduce --jobflow j-ABABABABA \
   --stream --jobconf mapred.task.timeout=600000 \
   --mapper s3://mybucket/mymapper.sh \
   --reducer s3://mybucket/myreducer.sh \
   --input s3://mybucket/input \
   --output s3://mybucket/output \
   --jobconf mapred.reduce.tasks=0

More details: https://forums.aws.amazon.com/thread.jspa?threadID=43872 and elastic-mapreduce --help

Kylie answered 31/10, 2012 at 20:40 Comment(3)

Yeah this is what we have been doing (using Apache CLI). But is there something similar to how we can pass java options when running hadoop locally as I have mentioned in the last line of my question? – Litotes 13/12, 2012 at 18:1

I edited my answer to clarify what I meant. If it still sounds like this is what you're doing, then I must be missing something. Let me know :) – Kylie 18/12, 2012 at 1:25

We are in effect doing what GenericOptionsParser would be doing internally... but we will be using GenericOptionsParser from now on. Thanks. – Litotes 14/1, 2013 at 18:55

In the context of Amazon Elastic MapReduce (Amazon EMR), you are looking for Bootstrap Actions:

Bootstrap actions allow you to pass a reference to a script stored in Amazon S3. This script can contain configuration settings and arguments related to Hadoop or Elastic MapReduce. Bootstrap actions are run before Hadoop starts and before the node begins processing data. [emphasis mine]

Section Running Custom Bootstrap Actions from the CLI provides a generic usage example:

& ./elastic-mapreduce --create --stream --alive \
--input s3n://elasticmapreduce/samples/wordcount/input \
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
--output s3n://myawsbucket 
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/download.sh

In particular, there are separate bootstrap actions to configure Hadoop and Java:

Hadoop (cluster)

You can specify Hadoop settings via bootstrap action Configure Hadoop, which allows you to set cluster-wide Hadoop settings, for example:

$ ./elastic-mapreduce --create \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
--args "--site-config-file,s3://myawsbucket/config.xml,-s,mapred.task.timeout=0"

Java (JVM)

You can specify custom JVM settings via bootstrap action Configure Daemons:

This predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collection behavior.

The provided example sets the heap size to 2048 and configures the Java namenode option:

$ ./elastic-mapreduce –create –alive \
  --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \
  --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19

Blinker answered 5/4, 2012 at 9:42 Comment(1)

We do this. But this is not I was looking for. Say you have 10 jobs to be run as streaming jobs one by one and you want different settings for different jobs, ho do you go about doing that? And as stated in my question we can do this while using ruby/python as mapper/reducer but unable to do that with using custom jar. For now we are passing the settings as normal args and setting them in JAVA code. – Litotes 9/4, 2012 at 14:11