Spark Redshift with Python
Asked Answered
T

7

6

I'm trying to connect Spark with amazon Redshift but i'm getting this error :

enter image description here

My code is as follow :

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext(appName="Connect Spark with Redshift")
sql_context = SQLContext(sc)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>)

df = sql_context.read \
    .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \
    .option("dbtable", "table_name") \
    .option("tempdir", "bucket") \
    .load()
Tailorbird answered 11/7, 2016 at 13:0 Comment(0)
D
11

Here is a step by step process for connecting to redshift.

  • Download the redshift connector file . try the below command
wget "https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC4-1.2.1.1001.jar"
  • save the below code in a python file(.py that you want to run) and replace the credentials accordingly.
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

#initialize the spark session 
spark = SparkSession.builder.master("yarn").appName("Connect to redshift").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sqlContext = HiveContext(sc)

sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "<ACCESSKEYID>")
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "<ACCESSKEYSECTRET>")


taxonomyDf = sqlContext.read \
    .format("com.databricks.spark.redshift") \
    .option("url", "jdbc:postgresql://url.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx") \
    .option("dbtable", "table_name") \
    .option("tempdir", "s3://mybucket/") \
    .load() 
  • run the spark-submit like below
spark-submit --packages com.databricks:spark-redshift_2.10:0.5.0 --jars RedshiftJDBC4-1.2.1.1001.jar test.py
Discovery answered 6/1, 2017 at 10:1 Comment(0)
D
3

If you are using Spark 2.0.4 and running your code on AWS EMR cluster, then please follow the below steps:-

1) Download the Redshift JDBC jar by using the below command:-

wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.20.1043/RedshiftJDBC4-no-awssdk-1.2.20.1043.jar

Refernce:- AWS Document

2) Copy the below-mentioned code in a python file and then replace the required values with your AWS resource:-

import pyspark
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "access key")
spark._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "secret access key")

sqlCon = SQLContext(spark)
df = sqlCon.createDataFrame([
    (1, "A", "X1"),
    (2, "B", "X2"),
    (3, "B", "X3"),
    (1, "B", "X3"),
    (2, "C", "X2"),
    (3, "C", "X2"),
    (1, "C", "X1"),
    (1, "B", "X1"),
], ["ID", "TYPE", "CODE"])

df.write \
  .format("com.databricks.spark.redshift") \
  .option("url", "jdbc:redshift://HOST_URL:5439/DATABASE_NAME?user=USERID&password=PASSWORD") \
  .option("dbtable", "TABLE_NAME") \
  .option("aws_region", "us-west-1") \
  .option("tempdir", "s3://BUCKET_NAME/PATH/") \
  .mode("error") \
  .save()

3) Run the below spark-submit command:-

spark-submit --name "App Name" --jars RedshiftJDBC4-no-awssdk-1.2.20.1043.jar --packages com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4 --py-files python_script.py python_script.py

Note:-

1) The Public IP address of the EMR node (on which the spark-submit job will run) should be allowed in the inbound rule of security group of Reshift cluster.

2) Redshift cluster and the S3 location used under "tempdir" should be there in the same geo-location. Here in the above example, both the resources are in us-west-1.

3) If the data is sensitive then do make sure to secure all the channels. To make the connections secure please follow the steps mentioned here under configuration.

Dietary answered 20/8, 2019 at 14:34 Comment(0)
F
1

The error is due to missing dependencies.

Verify that you have these jar files in the spark home directory:

  1. spark-redshift_2.10-3.0.0-preview1.jar
  2. RedshiftJDBC41-1.1.10.1010.jar
  3. hadoop-aws-2.7.1.jar
  4. aws-java-sdk-1.7.4.jar
  5. (aws-java-sdk-s3-1.11.60.jar) (newer version but not everything worked with it)

Put these jar files in $SPARK_HOME/jars/ and then start spark

pyspark --jars $SPARK_HOME/jars/spark-redshift_2.10-3.0.0-preview1.jar,$SPARK_HOME/jars/RedshiftJDBC41-1.1.10.1010.jar,$SPARK_HOME/jars/hadoop-aws-2.7.1.jar,$SPARK_HOME/jars/aws-java-sdk-s3-1.11.60.jar,$SPARK_HOME/jars/aws-java-sdk-1.7.4.jar

(SPARK_HOME should be = "/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec")

This will run Spark with all necessary dependencies. Note that you also need to specify the authentication type 'forward_spark_s3_credentials'=True if you are using awsAccessKeys.

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext(appName="Connect Spark with Redshift")
sql_context = SQLContext(sc)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>)

df = sql_context.read \
     .format("com.databricks.spark.redshift") \
     .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \
     .option("dbtable", "table_name") \
     .option('forward_spark_s3_credentials',True) \
     .option("tempdir", "s3n://bucket") \
     .load()

Common errors afterwards are:

  • Redshift Connection Error: "SSL off"
    • Solution: .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory")
  • S3 Error: When unloading the data, e.g. after df.show() you get the message: "The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."
    • Solution: The bucket & cluster must be run within the same region
Flintlock answered 26/11, 2016 at 18:13 Comment(1)
Man, i used the exact list of dependancies listed here and I still get an error: java.lang.IllegalArgumentException: Invalid hostname in URI s3n://somewhere or java.lang.IllegalArgumentException: Invalid hostname in URI s3://somwehere or s3a gives me java.lang.NoClassDefFoundError: com/amazonaws/SdkClientException at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235) :@Debera
G
0

if you are using databricks, I think you don't have to create a new sql Context because they do that for you just have to use sqlContext, try with this code:

from pyspark.sql import SQLContext
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET_ACCESS_KEY")

df = sqlContext.read \ .......

Maybe the bucket is not mounted

dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
Goode answered 26/7, 2016 at 19:12 Comment(0)
K
0

I'm late to this question, but I've spent a ton of time trying to get a local instance of pyspark connected to amazon Redshift. I am on a Mac, so your configuration might be slightly different. The result I've come to assumes the following:

  • Scala 2.12
  • Pyspark 3.3.0
  • openjdk 19.0.2
  • Python 3.10
from pyspark import SparkConf
from pyspark.sql import SparkSession
import urllib.parse

conf = (
    SparkConf()
    .set(
        'spark.jars.packages', 
        # All the dependencies for connecting to redshift
        # and S3
        (
            'org.apache.hadoop:hadoop-aws:3.3.2,'
            'org.apache.spark:spark-avro_2.12:3.3.2,'
            'com.amazonaws:aws-java-sdk-core:1.12.441,'
            'com.amazonaws:aws-java-sdk-redshift:1.12.441,'
            'com.amazonaws:aws-java-sdk-sts:1.12.441,'
            'com.fasterxml.jackson.dataformat:jackson-dataformat-cbor:2.14.2,'
            'com.amazon.redshift:redshift-jdbc42:2.1.0.13,'
            'com.eclipsesource.minimal-json:minimal-json:0.9.4'
        )
    )
    # this leverages the credentials in my
    # ~/.aws/credentials file configured by the CLI
    .set(
        'spark.hadoop.fs.s3a.aws.credentials.provider', 
        'com.amazonaws.auth.DefaultAWSCredentialsProviderChain'
    )
)

spark = (
    SparkSession
    .builder
    .master('local[*]')
    .config(conf=conf)
    .appName('jupyter')
    .getOrCreate()
)

To connect, I needed to url-format my credentials (I am working on getting the IAM role working, this uses basic auth):

# My password had special characters
# that needed to be formatted into the url
url = "jdbc:redshift://redshift-cluster.endpoint.region.redshift.amazonaws.com:5439/db?"

params = {'user': user, 'password': pw}
url = url + urllib.parse.urlencode(params)

And finally

df = (
    spark
    .read 
    .format('jdbc')
    .option("driver", "com.amazon.redshift.jdbc42.Driver")
    # This uses the s3a credentials mapped in locally
    .option("forward_spark_s3_credentials", True) 
    .option("url", url) 
    .option("dbtable", "schema.table") 
    .option("tempdir", "s3a://my-bucket/tmp/") 
    .load()
)

df.head(1)
Kallick answered 4/4, 2023 at 19:46 Comment(0)
B
-1

I think the s3n:// URL style has been deprecated and/or removed.

Try defining your keys as "fs.s3.awsAccessKeyId".

Bellay answered 11/7, 2016 at 17:41 Comment(2)
Thank you, i tried to change it but i still have the same errorTailorbird
s3n is still in used and it is not deprecated, please refer this link github.com/databricks/….Dietary
R
-1

I think that you need to add .format("com.databricks.spark.redshift") to your sql_context.read call; my hunch is that Spark can't infer the format for this data source, so you need to explicitly specify that we should use the spark-redshift connector.

For more detail on this error, see https://github.com/databricks/spark-redshift/issues/230

Rebato answered 18/10, 2016 at 21:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.