(python) Spark .textFile(s3://...) access denied 403 with valid credentials

In order to access my S3 bucket i have exported my creds

export AWS_SECRET_ACCESS_KEY=
export AWS_ACCESSS_ACCESS_KEY=

I can verify that everything works by doing

aws s3 ls mybucket

I can also verify with boto3 that it works in python

resource = boto3.resource("s3", region_name="us-east-1")
resource.Object("mybucket", "text/text.py") \
            .put(Body=open("text.py", "rb"),ContentType="text/x-py")

This works and I can see the file in the bucket.

However when I do this with spark:

spark_context = SparkContext()
sql_context = SQLContext(spark_context)
spark_context.textFile("s3://mybucket/my/path/*)

I get a nice

> Caused by: org.jets3t.service.S3ServiceException: Service Error
> Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error
> Message: <?xml version="1.0"
> encoding="UTF-8"?><Error><Code>InvalidAccessKeyId</Code><Message>The
> AWS Access Key Id you provided does not exist in our
> records.</Message><AWSAccessKeyId>[MY_ACCESS_KEY]</AWSAccessKeyId><RequestId>XXXXX</RequestId><HostId>xxxxxxx</HostId></Error>

this is how I submit the job locally

spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.11.98,org.apache.hadoop:hadoop-aws:2.7.3 test.py

Why does it works with command line + boto3 but spark is chocking ?

EDIT:

Same issue using s3a:// with

hadoopConf = spark_context._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.access.key", "xxxx")
hadoopConf.set("fs.s3a.secret.key", "xxxxxxx")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

and same issue using aws-sdk 1.7.4 and hadoop 2.7.2

Spark will automatically copy your AWS Credentials to the s3n and s3a secrets. Apache Spark releases don't touch the s3:// URLs, as in Apache Hadoop, the s3:// schema is associated with the original, now deprecated s3 client, one which is incompatible with everything else.

On Amazon EMR, s3:// is bound to amazon EMR S3; EC2 VMs will provide the secrets for executors automatically. So I don't think it bothers with the env var propagation mechanism. It might also be that how it sets up the authentication chain, you can't override the EC2/IAM data.

If you are trying to talk to S3 and you are not running in an EMR VM, then presumably you are using Apache Spark with the Apache Hadoop JARs, not the EMR versions. In that world use URLs with s3a:// to get the latest S3 client library

If that doesn't work, look at the troubleshooting section of the apache docs. There's a section on "403" there including recommended steps for troubleshooting. It can be due to classpath/JVM version problems as well as credential issues, even clock-skew between client and AWS.

Recommended topics

Hot tags