Using s3distcp with Amazon EMR to copy a single file
Asked Answered
V

2

6

I want to copy just a single file to HDFS using s3distcp. I have tried using the srcPattern argument but it didn't help and it keeps on throwing java.lang.Runtime exception. It is possible that the regex I am using is the culprit, please help.

My code is as follows:

elastic-mapreduce -j $jobflow --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://<mybucket>/<path>' --args '--dest,hdfs:///output' --arg --srcPattern --arg '(filename)'

Exception thrown:

Exception in thread "main" java.lang.RuntimeException: Error running job at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:586) at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:/tmp/a088f00d-a67e-4239-bb0d-32b3a6ef0105/files at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1036) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1028) at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:172) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:897) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:871) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1308) at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:568) ... 9 more
Vander answered 21/11, 2012 at 13:38 Comment(2)
Whoever downvoted it, may I know the reason to it ?Vander
What if you have many 15 GB files at a given location in s3, but your job needs only one of them and you want to have this file in your local hdfs via s3distcp!Vander
V
1

The regex I was using was indeed the culprit. Say the file names have dates, for example files are like abcd-2013-06-12.gz , then in order to copy ONLY this file, following emr command should do:

elastic-mapreduce -j $jobflow --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3:///' --args '--dest,hdfs:///output' --arg --srcPattern --arg '.*2013-06-12.gz'

If I remember correctly, my regex initially was *2013-06-12.gz and not .*2013-06-12.gz. So the dot at the beginning was needed.

Vander answered 18/6, 2013 at 18:12 Comment(0)
P
2

DistCp is intended to copy many files using many machines. DistCp is not the right tool if you want to only copy one file.

On the hadoop master node, you can copy a single file using

hadoop fs -cp s3://<mybucket>/<path> hdfs:///output

Paperback answered 10/6, 2013 at 3:32 Comment(2)
Thanks. Though it might not be intended but you certainly can copy it using S3distcp. Consider the scenario when you have an automated pipeline run where in cluster is launched and steps are added in those scenarios s3distcp comes in handy. Now, say I have a SINGLE 20GB gzip file which would amount to a single mapper running for hours(around 10 hours in our case); using it with s3distcp's '--outputCodec none' option, it not only copies the files to HDFS but decompresses the file allowing hadoop to create input splits, thus letting us use more than one mappers(time reduced to 2 hours).Vander
I should add that s3distcp does not work when I try to copy a single file from s3. I have to specify a prefix and then pattern to get the file I need. Not obvious from the documentation at all.Roosevelt
V
1

The regex I was using was indeed the culprit. Say the file names have dates, for example files are like abcd-2013-06-12.gz , then in order to copy ONLY this file, following emr command should do:

elastic-mapreduce -j $jobflow --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3:///' --args '--dest,hdfs:///output' --arg --srcPattern --arg '.*2013-06-12.gz'

If I remember correctly, my regex initially was *2013-06-12.gz and not .*2013-06-12.gz. So the dot at the beginning was needed.

Vander answered 18/6, 2013 at 18:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.