How to set hive.metastore.warehouse.dir in HiveContext?

Asked 28/5, 2015 at 22:30 Answered 28/6, 2019 at 7:44

apache-spark apache-spark-sql spark-hive

I'm trying to write a unit test case that relies on DataFrame.saveAsTable() (since it is backed by a file system). I point the hive warehouse parameter to a local disk location:

sql.sql(s"SET hive.metastore.warehouse.dir=file:///home/myusername/hive/warehouse")

By default, Embedded Mode of metastore should be enabled, thus doesn't require an external database.

But HiveContext seems to be ignoring this configuration: since I still get this error when calling saveAsTable():

MetaException(message:file:/user/hive/warehouse/users is not a directory or unable to create one)
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/user/hive/warehouse/users is not a directory or unable to create one)
    at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:619)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:172)
    at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:224)
    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:54)
    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:54)
    at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:64)
    at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1099)
    at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1099)
    at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1121)
    at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1071)
    at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1037)

This is quite annoying, why is it still happening and how to fix it?

Aerography answered 28/5, 2015 at 22:30 Comment(3)

javax.jdo.option.ConnectionURL doesn't help as well. Seems that it is too late when context have been already instantiated. – Poteen 18/6, 2015 at 16:53

wondering if you've ever solved this issue - I'm having same problem – Decile 5/4, 2016 at 14:32

same problem here (spark 1.6.1). tried setting with hive-site.xml and it seems to ignore it (but do parse the file as it fails launching if there is a xml syntax error) – Colwen 28/7, 2016 at 13:17

According to http://spark.apache.org/docs/latest/sql-programming-guide.html#sql

Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.

Logorrhea answered 10/10, 2016 at 5:16 Comment(0)

tl;dr Set hive.metastore.warehouse.dir while creating a SQLContext (or SparkSession).

The location of the default database for the Hive metastore warehouse is /user/hive/warehouse by default. It used to be set using hive.metastore.warehouse.dir Hive-specific configuration property (in a Hadoop configuration).

It's been a while since you asked this question (it's Spark 2.3 days), but that part has not changed since - if you use sql method of SQLContext (or SparkSession these days), it's simply too late to change where Spark creates the metastore database. It is far too late as the underlying infrastructure has been set up already (so you can use the SQLContext). The warehouse location has to be set up before the HiveContext / SQLContext / SparkSession initialization.

You should set hive.metastore.warehouse.dir while creating SparkSession (or SQLContext before Spark SQL 2.0) using config and (very important) enable the Hive support using enableHiveSupport.

config(key: String, value: String): Builder Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration.

enableHiveSupport(): Builder Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.

You could use hive-site.xml configuration file or spark.hadoop prefix, but I'm digressing (and it strongly depends on the current configuration).

Oliy answered 6/2, 2018 at 8:12 Comment(4)

i have set the hive.metastore.warehouse.dir to a remove hive database hdfs://xx.xx.xx:8020/user/hive/warehouse and then i enabled enableHiveSupport() even though its unable to read the tables from the hive do i have to change the XML file adding some properties to them as well ? – Bradawl 14/11, 2018 at 10:27

What's the Spark version? How did you set the configuration? – Oliy 14/11, 2018 at 11:16

Iam using spark 2.3.2 setting the configuration like below

val spark = SparkSession.builder()       .appName("ApplicationName")       .master("yarn")       .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")       .config("spark.executor.memory", "48120M")       .config("hive.metastore.warehouse.dir", "hdfs://ip-10-129-224-21.eu-west-1.compute.internal:8020/user/hive/warehouse")       .enableHiveSupport()       .getOrCreate()

– Bradawl 14/11, 2018 at 11:46

Can you set the option on command line when you spark-submit? – Oliy 14/11, 2018 at 12:35

Another option is to just create a new database and then USE new_DATATBASE and then create the table. The warehouse will be created under the folder you ran the sql-spark.

Devastation answered 8/8, 2018 at 19:52 Comment(0)

I faced exactly theI faced exactly the same issues. I was running spark-submit command in shell action via oozie.

Setting warehouse directory doesn't worked for me while creating sparksession

All you need to do is to pass the pass the hive-site.xml in spark-submit command using below property:

--files ${location_of_hive-site.xml}

Hypotaxis answered 28/6, 2019 at 7:44 Comment(0)

Recommended topics

Hot tags