Using s3distcp with Amazon EMR to copy a single file

Asked 21/11, 2012 at 13:38 Answered 18/6, 2013 at 18:12

Solved hadoop amazon-s3 mapreduce elastic-map-reduce emr

I want to copy just a single file to HDFS using s3distcp. I have tried using the srcPattern argument but it didn't help and it keeps on throwing java.lang.Runtime exception. It is possible that the regex I am using is the culprit, please help.

My code is as follows:

elastic-mapreduce -j $jobflow --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://<mybucket>/<path>' --args '--dest,hdfs:///output' --arg --srcPattern --arg '(filename)'

Exception thrown:

Exception in thread "main" java.lang.RuntimeException: Error running job at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:586) at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:/tmp/a088f00d-a67e-4239-bb0d-32b3a6ef0105/files at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1036) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1028) at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:172) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:897) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:871) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1308) at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:568) ... 9 more

Vander answered 21/11, 2012 at 13:38 Comment(2)

Whoever downvoted it, may I know the reason to it ? – Vander 14/12, 2012 at 6:37

What if you have many 15 GB files at a given location in s3, but your job needs only one of them and you want to have this file in your local hdfs via s3distcp! – Vander 14/12, 2012 at 6:44

The regex I was using was indeed the culprit. Say the file names have dates, for example files are like abcd-2013-06-12.gz , then in order to copy ONLY this file, following emr command should do:

elastic-mapreduce -j $jobflow --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3:///' --args '--dest,hdfs:///output' --arg --srcPattern --arg '.*2013-06-12.gz'

If I remember correctly, my regex initially was *2013-06-12.gz and not .*2013-06-12.gz. So the dot at the beginning was needed.

Vander answered 18/6, 2013 at 18:12 Comment(0)

DistCp is intended to copy many files using many machines. DistCp is not the right tool if you want to only copy one file.

On the hadoop master node, you can copy a single file using

hadoop fs -cp s3://<mybucket>/<path> hdfs:///output

Paperback answered 10/6, 2013 at 3:32 Comment(2)

Thanks. Though it might not be intended but you certainly can copy it using S3distcp. Consider the scenario when you have an automated pipeline run where in cluster is launched and steps are added in those scenarios s3distcp comes in handy. Now, say I have a SINGLE 20GB gzip file which would amount to a single mapper running for hours(around 10 hours in our case); using it with s3distcp's '--outputCodec none' option, it not only copies the files to HDFS but decompresses the file allowing hadoop to create input splits, thus letting us use more than one mappers(time reduced to 2 hours). – Vander 18/6, 2013 at 17:52

I should add that s3distcp does not work when I try to copy a single file from s3. I have to specify a prefix and then pattern to get the file I need. Not obvious from the documentation at all. – Roosevelt 1/4, 2016 at 21:48

The regex I was using was indeed the culprit. Say the file names have dates, for example files are like abcd-2013-06-12.gz , then in order to copy ONLY this file, following emr command should do:

elastic-mapreduce -j $jobflow --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3:///' --args '--dest,hdfs:///output' --arg --srcPattern --arg '.*2013-06-12.gz'

If I remember correctly, my regex initially was *2013-06-12.gz and not .*2013-06-12.gz. So the dot at the beginning was needed.

Vander answered 18/6, 2013 at 18:12 Comment(0)

Recommended topics

Hot tags