Spark Catalog w/ AWS Glue: database not found
Asked Answered
D

6

6

Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via

spark.catalog.setCurrentDatabase("test")
spark.catalog.listTables

However when I submit a job via spark-submit I get a fatal error

ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Database 'test' does not exist.;

I am creating my SparkSession within the job being submitted via spark-submit via

SparkSession.builder.enableHiveSupport.getOrCreate
Dewain answered 19/9, 2017 at 3:29 Comment(0)
S
17

Adding the hive.metastore.client.factory.class configuration to the code initiating the spark session solved the issue for me:

SparkSession spark = SparkSession.builder()
...
            .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
            .enableHiveSupport()
            .getOrCreate();

that's the same configuration defined in aws docs (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html) and added to the cluster configuration when checking Use for Hive table metadata on cluster creation, but for some reason dosn't work as expected (I'm using emr 5.12.0).

Sanctimonious answered 13/3, 2018 at 13:50 Comment(2)
Exactly, this config solved the issue. It is the same as selecting that metadata checkbox in the EMR creation with EC2. Thanks a lot.Duo
perfect answer for meTherron
I
5

I had the same issue: spark-submit will not discover the AWS Glue libraries, but spark-shell working on the master node will.

It turns out that my spark-submit job uses a fat .jar which was compiled with the standard org.apache.spark and org.apache.hive libraries. The jar libraries were being used in stead of the custom classes installed on EMR. If this is the case with you, make sure to exclude all:

'org.apache.spark:' 'org.apache.hive:' 'org.apache.hadoop:' modules from you .jar

Here is the reference I used for .Gradle: http://unethicalblogger.com/2015/07/15/gradle-goodness-excluding-depends-from-shadow.html.

Adding compileOnly keyword in front of all spark libraries fixed it.

Inessive answered 11/10, 2017 at 21:23 Comment(0)
D
1

Our issue was IAM permissions on the EMR cluster; make sure that the cluster IAM instance profile has full access to glue.

Dewain answered 12/10, 2017 at 18:57 Comment(0)
C
1

You should check the option "Use Glue data catalog as the Hive metastore" inside the Glue job; that's fundamental, otherwise Spark won't see the Glue catalog and will only see the "default" Database created by Glue.

Casady answered 21/3, 2023 at 14:55 Comment(0)
R
0

My problem ended up being that another classification configuration had been interfering with the spark-hive-site one. I deleted all others, and it finally was able to connect.

Raul answered 14/4, 2022 at 21:44 Comment(0)
H
-2

EMR 5.9.0 has just been released - please give it a shot, it should work for you.

Relevant documentation:

http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html

http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html

Hebrew answered 6/10, 2017 at 4:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.