In order to access my S3 bucket i have exported my creds
export AWS_SECRET_ACCESS_KEY=
export AWS_ACCESSS_ACCESS_KEY=
I can verify that everything works by doing
aws s3 ls mybucket
I can also verify with boto3 that it works in python
resource = boto3.resource("s3", region_name="us-east-1")
resource.Object("mybucket", "text/text.py") \
.put(Body=open("text.py", "rb"),ContentType="text/x-py")
This works and I can see the file in the bucket.
However when I do this with spark:
spark_context = SparkContext()
sql_context = SQLContext(spark_context)
spark_context.textFile("s3://mybucket/my/path/*)
I get a nice
> Caused by: org.jets3t.service.S3ServiceException: Service Error
> Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error
> Message: <?xml version="1.0"
> encoding="UTF-8"?><Error><Code>InvalidAccessKeyId</Code><Message>The
> AWS Access Key Id you provided does not exist in our
> records.</Message><AWSAccessKeyId>[MY_ACCESS_KEY]</AWSAccessKeyId><RequestId>XXXXX</RequestId><HostId>xxxxxxx</HostId></Error>
this is how I submit the job locally
spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.11.98,org.apache.hadoop:hadoop-aws:2.7.3 test.py
Why does it works with command line + boto3 but spark is chocking ?
EDIT:
Same issue using s3a:// with
hadoopConf = spark_context._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.access.key", "xxxx")
hadoopConf.set("fs.s3a.secret.key", "xxxxxxx")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
and same issue using aws-sdk 1.7.4 and hadoop 2.7.2
SparkConf
to set the values – Allix