Use S3DistCp to copy file from S3 to EMR

G

2

8

I am struggling to find a way to use S3DistCp in my AWS EMR Cluster.

Some old examples which show how to add s3distcp as an EMR step use elastic-mapreduce command which is not used anymore.

Some other sources suggest to use s3-dist-cp command, which is not found in current EMR clusters. Even official documentation (online and EMR developer guide 2016 pdf) present an example like this:

aws emr add-steps --cluster-id j-3GYXXXXXX9IOK --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,Args=["--s3Endpoint,s3-eu-west-1.amazonaws.com","--src,s3://mybucket/logs/j-3GYXXXXXX9IOJ/node/","--dest,hdfs:///output","--srcPattern,.*[azA-Z,]+"]

But there is no lib folder in the /home/hadoop path. I found some hadoop libraries in this folder: /usr/lib/hadoop/lib, but I cannot find s3distcp from anywhere.

Then I found that there are some libraries available in some S3 buckets. For example, from this question, I found this path: s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar. This seemed to be a step in the right direction, as adding a new step to a running EMR cluster from the AWS interface with these parameters started the step (which it didn't with previous attempts) but failed after ~15seconds:

JAR location: s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar
Main class: None
Arguments: --s3Endpoint s3-eu-west-1.amazonaws.com --src s3://source-bucket/scripts/ --dest hdfs:///output
Action on failure: Continue

This resulted in the following error:

Exception in thread "main" java.lang.RuntimeException: Unable to retrieve Hadoop configuration for key fs.s3n.awsAccessKeyId
    at com.amazon.external.elasticmapreduce.s3distcp.ConfigurationCredentials.getConfigOrThrow(ConfigurationCredentials.java:29)
    at com.amazon.external.elasticmapreduce.s3distcp.ConfigurationCredentials.<init>(ConfigurationCredentials.java:35)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.createInputFileListS3(S3DistCp.java:85)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.createInputFileList(S3DistCp.java:60)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:529)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

I thought this may have been caused by the incompatibility of my S3 location (same as the endpoint) and the location of the s3distcp script, which was from us-east. I replaced it with eu-west-1 and still got the same error about the authentication. I have used a similar setup to run my scala scripts (Custom jar type with "command-runner.jar" script with the first argument "spark-submit" to run a spark job and this works, I have not had this problem with the authentication before.

What is the simplest way to copy a file from S3 to an EMR cluster? Either by adding an additional EMR step with AWS SDK (for Go lang) or somehow inside the Scala spark script? Or from the AWS EMR interface, but not from CLI as I need it to be automated.

Glassworker answered 8/9, 2016 at 11:38 Comment(1)

The s3-dist-cp command was already automatically on my 6.11 cluster with hadoop (and spark) installed, so maybe they brought it back, or maybe you need to select hadoop as an app to include on the cluster - IDK. FWIW, my issue was forgetting to put /usr/hadoop (forgot the exact spelling now) in my hdfs file path. – Towandatoward 31/8, 2023 at 20:58

P

11

The CLI that comes installed in EMR is aws <servicename> <function>:

aws s3 cp s3://bucket/path/to/remote/file.sh /local/path/to/file.sh

https://aws.amazon.com/cli/

As far as automating that, its certainly reasonable to throw your commands into a custom step where the "path" to the command is simply "command-runner.jar" and then the arg of the step is the command itself.

So, ultimately, CLI code can do the same thing:

aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Name="Command Runner",Jar="command-runner.jar",Args=["spark-submit","Args..."]

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html

Protomorphic answered 8/9, 2016 at 23:45 Comment(2)

Thanks! I had used aws s3 cp before but I don't know how I didn't think of using it together with command-runner.jar :) – Glassworker 9/9, 2016 at 11:27

Wouldn't this mess with emrfs consistency? – Towandatoward 31/8, 2023 at 17:4

S

2

aws emr add-steps --profile <> --cluster-id <> --steps Type=CUSTOM_JAR,Name=UPLOAD_JAR_CONFIG,ActionOnFailure=CANCEL_AND_WAIT,Jar=command-runner.jar,Args=[s3-dist-cp,--src,s3a://<>/,--dest,hdfs:///<>/<>/,--srcPattern=.*.*]

Thanks for previous answers. I was stuck but was able to build this to use dist-cp to copy to emr from s3

Sungkiang answered 16/11, 2017 at 21:32 Comment(0)

Recommended topics

Hot tags