Getting data in and out of Elastic MapReduce HDFS

Asked 9/10, 2011 at 5:42 Answered 16/4, 2014 at 16:28

I've written a Hadoop program which requires a certain layout within HDFS, and which afterwards, I need to get the files out of HDFS. It works on my single-node Hadoop setup and I'm eager to get it working on 10's of nodes within Elastic MapReduce.

What I've been doing is something like this:

./elastic-mapreduce --create --alive
JOBID="j-XXX" # output from creation
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp s3://bucket-id/XXX /XXX"
./elastic-mapreduce -j $JOBID --jar s3://bucket-id/jars/hdeploy.jar --main-class com.ranjan.HadoopMain --arg /XXX

This is asynchronous, but when the job's completed, I can do this

./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp /XXX s3://bucket-id/XXX-output"
./elastic-mapreduce -j $JOBID --terminate

So while this sort-of works, but it's clunky and not what I'd like. Is there cleaner way to do this?

Thanks!

Backhand answered 9/10, 2011 at 5:42 Comment(0)

You can use distcp which will copy the files as a mapreduce job

# download from s3
$ hadoop distcp s3://bucket/path/on/s3/ /target/path/on/hdfs/
# upload to s3
$ hadoop distcp /source/path/on/hdfs/ s3://bucket/path/on/s3/

This makes use of your entire cluster to copy in parallel from s3.

(note: the trailing slashes on each path are important to copy from directory to directory)

Brahmanism answered 12/10, 2011 at 18:18 Comment(1)

What I ended up doing was using EMR's ability to run a script stored on S3: elastic-mapreduce -j $JOB_ID --jar s3://elasticmapreduce/libs/script-runner/script-runner.jar --args "${S3_BUCKET}/scripts/copy_to_hdfs.sh,..." Where the copy_to_hdfs.sh is a bash script which does: hadoop fs -cp [...] Control flow still seems a bit convoluted, but it's able to run unattended. – Backhand 19/10, 2011 at 23:47

@mat-kelcey, does the command distcp expect the files in S3 to have a minimum permission level? For some reason I have to set permission levels of the files to "Open/Download" and "View Permissions" for "Everyone", for the files to be able accessible from within the bootstrap or the step scripts.

Haunt answered 16/4, 2014 at 16:28 Comment(0)

Recommended topics

Hot tags