How to read a file from s3 in EMR?
Asked Answered
G

3

6

I would like to read a file from S3 in my EMR Hadoop job. I am using the Custom JAR option.

I have tried two solutions:

  • org.apache.hadoop.fs.S3FileSystem: throws a NullPointerException.
  • com.amazonaws.services.s3.AmazonS3Client: throws an exception, saying "Access denied".

What I fail to grasp is that I am starting the job from the Console, so obviously I should have the necessary permissions. However, the AWS_*_KEY keys are missing from the environment variables (System.getenv()) that are available to the mapper.

I am sure I do something wrong, just not sure what.

Gyatt answered 12/6, 2014 at 12:43 Comment(0)
F
2

I think that your EMR cluster need to have access to S3, you can create an IAM role for your EMR cluster and give it access to S3. check this link : http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html

Franctireur answered 19/6, 2014 at 13:24 Comment(1)
This was the right way to go. Without roles, the only solution is to write the access keys directly into the code (or a file in the jar, etc.). Using roles worked without danger of exposing the credentials.Gyatt
H
4

Probably a little bit late, but... Use InstanceProfileCredentialsProvider for AmazonS3Client.

Heisel answered 22/5, 2016 at 21:53 Comment(2)
Why did this get a down vote is beyond me, very helpful.Equal
This is life savingGudrunguelderrose
F
2

I think that your EMR cluster need to have access to S3, you can create an IAM role for your EMR cluster and give it access to S3. check this link : http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html

Franctireur answered 19/6, 2014 at 13:24 Comment(1)
This was the right way to go. Without roles, the only solution is to write the access keys directly into the code (or a file in the jar, etc.). Using roles worked without danger of exposing the credentials.Gyatt
G
0

I think the syntax is

hadoop jar your.jar com.your.main.Class -Dfs.s3n.awsAccessKeyId=<access-id> -Dfs.s3n.awsSecretAccessKey=<secrect-key>

Then the path to the common prefix you wish to read should be of the form

s3n://bucket-name/common/prefix/path
Gender answered 13/6, 2014 at 9:21 Comment(3)
I am running the JAR on EMR. I do not have a hadoop command there, as far as I know.Gyatt
EMR sucks ... get a devops to build you a proper EC2 cluster :) @DavidNemeskeyGender
hadoop command is present in EMR instances (at least in version 5.3.0)Bettyannbettye

© 2022 - 2024 — McMap. All rights reserved.