Spark is inventing his own AWS secretKey
Asked Answered
S

4

8

I'm trying to read a s3 bucket from Spark and up until today Spark always complain that the request return 403

hadoopConf = spark_context._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.access.key", "ACCESSKEY")
hadoopConf.set("fs.s3a.secret.key", "SECRETKEY")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
logs = spark_context.textFile("s3a://mybucket/logs/*)

Spark was saying .... Invalid Access key [ACCESSKEY]

However with the same ACCESSKEY and SECRETKEY this was working with aws-cli

aws s3 ls mybucket/logs/

and in python boto3 this was working

resource = boto3.resource("s3", region_name="us-east-1")
resource.Object("mybucket", "logs/text.py") \
            .put(Body=open("text.py", "rb"),ContentType="text/x-py")

so my credentials ARE invalid and the problem is definitely something with Spark..

Today I decided to turn on the "DEBUG" log for the entire spark and to my suprise... Spark is NOT using the [SECRETKEY] I have provided but instead... add a random one???

17/03/08 10:40:04 DEBUG request: Sending Request: HEAD https://mybucket.s3.amazonaws.com / Headers: (Authorization: AWS ACCESSKEY:[RANDON-SECRET-KEY], User-Agent: aws-sdk-java/1.7.4 Mac_OS_X/10.11.6 Java_HotSpot(TM)_64-Bit_Server_VM/25.65-b01/1.8.0_65, Date: Wed, 08 Mar 2017 10:40:04 GMT, Content-Type: application/x-www-form-urlencoded; charset=utf-8, )

This is why it still return 403! Spark is not using the key I provide with fs.s3a.secret.key but instead invent a random one??

For the record I'm running this locally on my machine (OSX) with this command

spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.11.98,org.apache.hadoop:hadoop-aws:2.7.3 test.py

Could some one enlighten me on this?

Strophanthus answered 8/3, 2017 at 10:49 Comment(2)
did you ever find an answer to this question that worked ?Greenish
Nop, I never find out where these secret keys were getting generated fromConiology
B
2

(updated as my original one was downvoted as clearly considered unacceptable)

The AWS auth protocol doesn't send your secret over the wire. It signs the message. That's why what you see isn't what you passed in.

For further information, please reread.

Bisector answered 8/3, 2017 at 12:18 Comment(2)
"For the record I'm running this locally on my machine (OSX) with this command" As explained im running this locally with apache-spark binnariesStrophanthus
I've tried to be helpful as I can as the person who has been working full time on S3A development for 12 months. First, As stated, you are using a different AWS SDK. Fix. Second, Spark propagates env vars. Unset. Then try to get the hadoop fs -ls s3a://mybucket/ command working before looking at in-cluster. Then look at spark-defaults and core-site.xml to see what's happening. Finally, embrace the logs and the debuggers.Bisector
B
2

I ran into a similar issue. Requests that were using valid AWS credentials returned a 403 Forbidden, but only on certain machines. Eventually I found out that the system time on those particular machines were 10 minutes behind. Synchronizing the system clock solved the problem.

Hope this helps!

Bubalo answered 29/11, 2017 at 21:58 Comment(0)
E
1

It is very intriguing this random passkey. Maybe AWS SDK is getting the password from OS environment.

In hadoop 2.8, the default AWS provider chain shows the following list of providers:

BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider

Order, of course, matters! the AWSCredentialProviderChain, get the first keys from the first provider that provides that information.

            if (credentials.getAWSAccessKeyId() != null &&
                credentials.getAWSSecretKey() != null) {
                log.debug("Loading credentials from " + provider.toString());
                lastUsedProvider = provider;
                return credentials;
            } 

See the code in "GrepCode for AWSCredentialProviderChain".

I face similar problem using profile credentials. SDK was ignoring the credentials inside ~/.aws/credentials (as good practice, I encourage you to not store credentials inside the program in any way).

My solution...

Set the credentials provider to use ProfileCredentialsProvider

sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.eu-central-1.amazonaws.com") # yes, I am using central eu server.
sc._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', 'com.amazonaws.auth.profile.ProfileCredentialsProvider')
Encarnalize answered 11/5, 2017 at 8:52 Comment(2)
spark-submit will pick up the env vars from the client if set...that could be how things get in. Regarding the fs.s3a.aws.credentials.provider : that's a Hadoop 2.8 feature only: anyone trying to use it with the Hadoop 2.7 JARs will be disappointed. They should try unsetting the env vars before submitting workBisector
for the record, it was a random key everytime (never the same)Strophanthus
A
0

Folks, go for the IAM configuration based on Roles ... that will open up S3 access policies that should be added to the EMR default one.

Abert answered 20/4, 2021 at 17:18 Comment(1)
Go luck doing that localy on a local spark cluster :)Coniology

© 2022 - 2024 — McMap. All rights reserved.