How to specify AWS Access Key ID and Secret Access Key as part of a amazon s3n URL
Asked Answered
A

8

25

I am passing input and output folders as parameters to mapreduce word count program from webpage.

Getting below error:

HTTP Status 500 - Request processing failed; nested exception is java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

Ankledeep answered 24/7, 2014 at 3:48 Comment(0)
I
40

The documentation has the format: http://wiki.apache.org/hadoop/AmazonS3

 s3n://ID:SECRET@BUCKET/Path
Ilka answered 24/7, 2014 at 12:32 Comment(3)
Unfortunately this does not work if it happens that the secret has a "/" in it. Which is quite frequent. It's an old known bug issues.apache.org/jira/browse/HADOOP-3733, and may be fixed in hadoop 2.8 for s3a protocol. issues.apache.org/jira/browse/HADOOP-11573. The alternative is to put the keys in conf (but this has other caveats too)Deauville
It worked for emr-4.3.0. Emr-4.4.0 and emr-4,5,0 throw java.lang.IllegalArgumentException: Bucket name must not be formatted as an IP Address, as if the ID and the SECRET were part of the bucket name. Emr-4.6.0 throws java.lang.IllegalArgumentException: Bucket name should be between 3 and 63 characters long. Any ideas?Hypostatize
s3n is not supported anymoreRemedial
A
10

I suggest you use this:

hadoop distcp \
-Dfs.s3n.awsAccessKeyId=<your_access_id> \ 
-Dfs.s3n.awsSecretAccessKey=<your_access_key> \
s3n://origin hdfs://destinations

It also works as a workaround for the occurrence of slashes in the key. The parameters with the id and access key must be supplied exactly in this order: after disctcp and before origin

Alarcon answered 7/3, 2016 at 15:9 Comment(1)
s3n is not supported anymoreRemedial
V
8

Passing in the AWS Credentials as part of the Amazon s3n url is not normally recommended, security wise. Especially if that code is pushed to a repository holding service (like github). Ideally set your credentials in the conf/core-site.xml as:

<configuration>
  <property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>XXXXXX</value>
  </property>

  <property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>XXXXXX</value>
  </property>
</configuration>

or reinstall awscli on your machine.

pip install awscli
Vaclav answered 18/5, 2016 at 18:4 Comment(5)
Where to add the <configuration> data? My pom.xml doen't seem to like it. I'm running a Spark job on a CentOS VM, and installing and configuring AWS CLI also didn't help.Anywhere
add it in this file: conf/core-site.xmlVaclav
What and where is this conf/core-site.xml?Anywhere
what if there are different s3 accounts requiring different keys?Pleiad
@prometheus2305 Unfortunately I was not able to solve that problem.Vaclav
B
2

For pyspark beginner:

Prepare

Download jar from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
, put this to spark jars folder

Then you can

1. Hadoop config file

core-site.xml

export AWS_ACCESS_KEY_ID=<access-key>
export AWS_SECRET_ACCESS_KEY=<secret-key>

<configuration>
  <property>
    <name>fs.s3n.impl</name>
    <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
  </property>

  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  </property>

  <property>
    <name>fs.s3.impl</name>
    <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
  </property>
</configuration>

2. pyspark config

sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem")

Example

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf


if __name__ == "__main__":
    """
        Usage: S3 sample
    """
    access_key = '<access-key>'
    secret_key = '<secret-key>'

    spark = SparkSession\
        .builder\
        .appName("Demo")\
        .getOrCreate()

    sc = spark.sparkContext

    # remove this block if use core-site.xml and env variable
    sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", access_key)
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", access_key)
    sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
    sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", secret_key)
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", secret_key)
    sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
    sc._jsc.hadoopConfiguration().set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
    sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem")

    # fetch from s3, returns RDD
    csv_rdd = spark.sparkContext.textFile("s3n://<bucket-name>/path/to/file.csv")
    c = csv_rdd.count()
    print("~~~~~~~~~~~~~~~~~~~~~count~~~~~~~~~~~~~~~~~~~~~")
    print(c)

    spark.stop()
Badly answered 1/2, 2019 at 7:11 Comment(0)
W
0

create file core-site.xml and put it in class path. In the file specify

<configuration>
    <property>
        <name>fs.s3.awsAccessKeyId</name>
        <value>your aws access key id</value>
        <description>
            aws s3 key id
        </description>
    </property>

    <property>
        <name>fs.s3.awsSecretAccessKey</name>
        <value>your aws access key</value>
        <description>
            aws s3 key
        </description>
    </property>
</configuration>

Hadoop by default specifies two resources, loaded in-order from the classpath:

  • core-default.xml: Read-only defaults for hadoop
  • core-site.xml: Site-specific configuration for a given hadoop installation
Wingate answered 7/11, 2019 at 20:33 Comment(0)
K
0

Change s3 to s3n in the s3 URI

Knighterrant answered 19/7, 2021 at 16:49 Comment(1)
Please try to give proper explanation of the answer.Trula
M
0
hadoop distcp \
  -Dfs.s3a.access.key=<....> \
  -Dfs.s3a.secret.key=<....> \
  -Dfs.s3a.fast.upload=true \
  -update \
  s3a://path to file/ hdfs:///path/
Mediocrity answered 20/11, 2021 at 13:0 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Provo
T
0

s3n://ID:SECRET@BUCKET/Path

special characters like / can be escaped using URLEncode.encode method e.g

String ID = URLEncoder.encode("<AWS_ACCESS_KEY_ID>", StandardCharsets.UTF_8.toString());
String SECRET= URLEncoder.encode("<AWS_SECRET_ACCESS_KEY>", StandardCharsets.UTF_8.toString());
Toxophilite answered 10/1, 2024 at 16:55 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.