copy files from amazon s3 to hdfs using s3distcp fails
Asked Answered
C

4

7

I am trying to copy files from s3 to hdfs using workflow in EMR and when I run the below command the jobflow successfully starts but gives me an error when it tries to copy the file to HDFS .Do i need to set any input file permissions ?

Command:

./elastic-mapreduce --jobflow j-35D6JOYEDCELA --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://odsh/input/,--dest,hdfs:///Users

Output

Task TASKID="task_201301310606_0001_r_000000" TASK_TYPE="REDUCE" TASK_STATUS="FAILED" FINISH_TIME="1359612576612" ERROR="java.lang.RuntimeException: Reducer task failed to copy 1 files: s3://odsh/input/GL_01112_20121019.dat etc at com.amazon.external.elasticmapreduce.s3distcp.CopyFilesReducer.close(CopyFilesReducer.java:70) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:538) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:429) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249)

Coacervate answered 31/1, 2013 at 17:0 Comment(0)
L
7

I'm getting the same exception. It looks like the bug is caused by a race condition when CopyFilesReducer uses multiple CopyFilesRunable instances to download the files from S3. The problem is that it uses the same temp directory in multiple threads, and the threads delete the temp directory when they're done. Hence, when one thread completes before another it deletes the temp directory that another thread is still using.

I've reported the problem to AWS, but in the mean time you can work around the bug by forcing the reducer to use a single thread by setting the variable s3DistCp.copyfiles.mapper.numWorkers to 1 in your job config.

Lindsaylindsey answered 4/11, 2013 at 22:11 Comment(3)
anyone knows if this is fixed by amazon?Purusha
Note: Make sure you put a space between the -D and the s3 and you also have to put this right after the foobar.jar part.Acyclic
After all these years its still so relevant. Wonder why its not still fixed.Transubstantiation
G
3

I see this same problem caused by race condition. Passing -Ds3DistCp.copyfiles.mapper.numWorkers=1 helps avoid the problem.

I hope Amazon fixes this bug.

Goggleeyed answered 12/7, 2014 at 22:17 Comment(1)
Note: Make sure you put a space between the -D and the s3 and you also have to put this right after the foobar.jar part.Acyclic
P
2

Adjusting the number of workers didn't work for me; s3distcp always failed on a small/medium instance. Increasing the heap size of the task job (via -D mapred.child.java.opts=-Xmx1024m) solved it for me.

Example usage:

hadoop jar /home/hadoop/lib/emr-s3distcp-1.0.jar 
    -D mapred.child.java.opts=-Xmx1024m 
    --src s3://source/
    --dest hdfs:///dest/ --targetSize 128
    --groupBy '.*\.([0-9]+-[0-9]+-[0-9]+)-[0-9]+\..*' 
    --outputCodec gzip
Phelips answered 2/10, 2014 at 4:44 Comment(2)
Note: Make sure you put the -D command right after the foobar.jar commandAcyclic
In emr-5.29.0 it is in /usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar, and not in the locations given above.Superannuation
C
2

The problem is the map - reduce jobs fail. Mapper execute perfectly but reducers create a bottle neck at the clusters memory.

THIS SOLVED for me -Dmapreduce.job.reduces=30 if it still fails try to

reduce it to 20 i.e. -Dmapreduce.job.reduces=20

I'll add the entire argument for ease of understanding:

In AWS Cluster:

JAR location : command-runner.jar

Main class : None

Arguments : s3-dist-cp -Dmapreduce.job.reduces=30 --src=hdfs:///user/ec2-user/riskmodel-output --dest=s3://dev-quant-risk-model/2019_03_30_SOM_EZ_23Factors_Constrained_CSR_Stats/output --multipartUploadChunkSize=1000

Action on failure: Continue

in script file:

aws --profile $AWS_PROFILE emr add-steps --cluster-id $CLUSTER_ID --steps Type=CUSTOM_JAR,Jar='command-runner.jar',Name="Copy Model Output To S3",ActionOnFailure=CONTINUE,Args=[s3-dist-cp,-Dmapreduce.job.reduces=20,--src=$OUTPUT_BUCKET,--dest=$S3_OUTPUT_LARGEBUCKET,--multipartUploadChunkSize=1000]

Chaps answered 1/4, 2019 at 8:6 Comment(1)
I found this article that is related with this solution and can help to understand why the reduces configuration works. oakgreen.blogspot.com/2015/05/…Staceystaci

© 2022 - 2024 — McMap. All rights reserved.