I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custom jar.
We can use the following way to specify these configurations when we run using external scripting languages like ruby or python:
ruby elastic-mapreduce -j --stream --step-name "mystream" --jobconf mapred.task.timeout=0 --jobconf mapred.min.split.size=52880 --mapper s3://somepath/mapper.rb --reducer s3:somepath/reducer.rb --input s3://somepath/input --output s3://somepath/output
I tried the following ways, but none of them worked:
ruby elastic-mapreduce --jobflow --jar s3://somepath/job.jar --arg s3://somepath/input --arg s3://somepath/output --args -m,mapred.min.split.size=52880 -m,mapred.task.timeout=0
ruby elastic-mapreduce --jobflow --jar s3://somepath/job.jar --arg s3://somepath/input --arg s3://somepath/output --args -jobconf,mapred.min.split.size=52880 -jobconf,mapred.task.timeout=0
I would also like to know how to pass java options to a streaming job using custom jar in EMR. When running locally on hadoop we can pass it as follows:
bin/hadoop jar job.jar input_path output_path -D< some_java_parameter >=< some_value >