(python) Spark .textFile(s3://...) access denied 403 with valid credentials
Asked Answered
A

1

8

In order to access my S3 bucket i have exported my creds

export AWS_SECRET_ACCESS_KEY=
export AWS_ACCESSS_ACCESS_KEY=

I can verify that everything works by doing

aws s3 ls mybucket

I can also verify with boto3 that it works in python

resource = boto3.resource("s3", region_name="us-east-1")
resource.Object("mybucket", "text/text.py") \
            .put(Body=open("text.py", "rb"),ContentType="text/x-py")

This works and I can see the file in the bucket.

However when I do this with spark:

spark_context = SparkContext()
sql_context = SQLContext(spark_context)
spark_context.textFile("s3://mybucket/my/path/*)

I get a nice

> Caused by: org.jets3t.service.S3ServiceException: Service Error
> Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error
> Message: <?xml version="1.0"
> encoding="UTF-8"?><Error><Code>InvalidAccessKeyId</Code><Message>The
> AWS Access Key Id you provided does not exist in our
> records.</Message><AWSAccessKeyId>[MY_ACCESS_KEY]</AWSAccessKeyId><RequestId>XXXXX</RequestId><HostId>xxxxxxx</HostId></Error>

this is how I submit the job locally

spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.11.98,org.apache.hadoop:hadoop-aws:2.7.3 test.py

Why does it works with command line + boto3 but spark is chocking ?

EDIT:

Same issue using s3a:// with

hadoopConf = spark_context._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.access.key", "xxxx")
hadoopConf.set("fs.s3a.secret.key", "xxxxxxx")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

and same issue using aws-sdk 1.7.4 and hadoop 2.7.2

Azeotrope answered 7/3, 2017 at 14:47 Comment(5)
read this? cloudera.com/documentation/enterprise/latest/topics/…Allix
I think that it would also work with exporting the AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY. is create that credentials file really a necessity? as you can see Spark correctly picks up the AWS_ACCESS_KEY from the env variable but for reason fails to authenticate ?Azeotrope
Spark is distributed. Because you have ENV variables in one executor doesn't mean the other executors have it as well. You should use SparkConf to set the valuesAllix
getting the same error with hadoopConf = spark_context._jsc.hadoopConfiguration() hadoopConf.set("fs.s3.awsAccessKeyId", "xxxxx") hadoopConf.set("fs.s3.awsSecretAccessKey", "xxxxxx) and same error when setting them with SparkConfAzeotrope
@cricket_007I think I found the issue I have created a new post for it #42669746Azeotrope
J
7

Spark will automatically copy your AWS Credentials to the s3n and s3a secrets. Apache Spark releases don't touch the s3:// URLs, as in Apache Hadoop, the s3:// schema is associated with the original, now deprecated s3 client, one which is incompatible with everything else.

On Amazon EMR, s3:// is bound to amazon EMR S3; EC2 VMs will provide the secrets for executors automatically. So I don't think it bothers with the env var propagation mechanism. It might also be that how it sets up the authentication chain, you can't override the EC2/IAM data.

If you are trying to talk to S3 and you are not running in an EMR VM, then presumably you are using Apache Spark with the Apache Hadoop JARs, not the EMR versions. In that world use URLs with s3a:// to get the latest S3 client library

If that doesn't work, look at the troubleshooting section of the apache docs. There's a section on "403" there including recommended steps for troubleshooting. It can be due to classpath/JVM version problems as well as credential issues, even clock-skew between client and AWS.

Justificatory answered 7/3, 2017 at 16:6 Comment(7)
Sorry I forgot to mention it in the originap post but I'm not running this in an EMR (it works in there). I'm running it locally. I will try with s3a://Azeotrope
Same thing but the error slighly different: Error Message: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: XXXX, AWS Error Code: null, AWS Error Message: ForbiddenAzeotrope
OK, well S3A is the ASF code. As well as an auth problem, it can suffer from an incompatibility between joda-time and the JVM version. I'll edit my answer with a link to the relevant docsJustificatory
Thanks for the link. My clock is at the right time, I use java 7 and as specified in the original post my creds are correct because both aws s3 command like and boto3 CAN read the bucket without any problemAzeotrope
@Seteve I think I found the issue, I have create a new post for it #42669746Azeotrope
saw that. I think it's EMR related. Try putting the secrets in the URL as a final experiment, remembering to escape the + and / symbols, and remembering never to share the secret with anyone elseJustificatory
Im not running anything in EMR....... this is all locally. I have tried as well the url one but spark is still magically generating a random secret_key instead of reading the one I provideAzeotrope

© 2022 - 2024 — McMap. All rights reserved.