How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?

Asked 13/8, 2015 at 6:4 Answered 23/9, 2022 at 1:57

Solved apache-spark hive apache-spark-sql

I'm using HiveContext with SparkSQL and I'm trying to connect to a remote Hive metastore, the only way to set the hive metastore is through including the hive-site.xml on the classpath (or copying it to /etc/spark/conf/).

Is there a way to set this parameter programmatically in a java code without including the hive-site.xml ? If so what is the Spark configuration to use ?

Kira answered 13/8, 2015 at 6:4 Comment(0)

For Spark 1.x, you can set with :

System.setProperty("hive.metastore.uris", "thrift://METASTORE:9083");
            
final SparkConf conf = new SparkConf();
SparkContext sc = new SparkContext(conf);
HiveContext hiveContext = new HiveContext(sc);

final SparkConf conf = new SparkConf();
SparkContext sc = new SparkContext(conf);
HiveContext hiveContext = new HiveContext(sc);
hiveContext.setConf("hive.metastore.uris", "thrift://METASTORE:9083");

Update If your Hive is Kerberized :

Try setting these before creating the HiveContext :

System.setProperty("hive.metastore.sasl.enabled", "true");
System.setProperty("hive.security.authorization.enabled", "false");
System.setProperty("hive.metastore.kerberos.principal", hivePrincipal);
System.setProperty("hive.metastore.execute.setugi", "true");

Kira answered 13/8, 2015 at 16:31 Comment(3)

In case of a Kerberized remote Hive cluster what are the additional HiveContext config setting needed? The above code doesn't work for me in this case and I'm wondering if you have a solution. Thanks. – Far 23/11, 2016 at 21:19

Still doesn't work. My settings: ("hive.metastore.uris","myValue") ("login.user","myValue") ("keytab.file", "myValue") ("sun.security.krb5.debug","false") ("java.security.krb5.conf","myValue") ("java.library.path","myValue") ("hadoop.home.dir","myValue") ("hadoop.security.authentication","kerberos") ("hive.metastore.sasl.enabled", "true") ("hive.security.authorization.enabled", "false") ("hive.metastore.kerberos.principal", "myValue") ("hive.metastore.execute.setugi", "true") – Far 25/11, 2016 at 14:41

Looks to me my local Spark is not even attempting to connect the remote Hive based on what I see in the log – Far 25/11, 2016 at 14:44

In spark 2.0.+ it should look something like that:

Don't forget to replace the "hive.metastore.uris" with yours. This assume that you have a hive metastore service started already (not a hiveserver).

 val spark = SparkSession
          .builder()
          .appName("interfacing spark sql to hive metastore without configuration file")
          .config("hive.metastore.uris", "thrift://localhost:9083") // replace with your hivemetastore service's thrift url
          .enableHiveSupport() // don't forget to enable hive support
          .getOrCreate()

        import spark.implicits._
        import spark.sql
        // create an arbitrary frame
        val frame = Seq(("one", 1), ("two", 2), ("three", 3)).toDF("word", "count")
        // see the frame created
        frame.show()
        /**
         * +-----+-----+
         * | word|count|
         * +-----+-----+
         * |  one|    1|
         * |  two|    2|
         * |three|    3|
         * +-----+-----+
         */
        // write the frame
        frame.write.mode("overwrite").saveAsTable("t4")

Alpha answered 28/11, 2016 at 16:37 Comment(2)

If I provide configurations using above config(..) method while simultaneously also providing hive-site.xml, which one would persist? Is there a way to control that? – Steamboat 9/4, 2018 at 7:23

Both. hive-site.xml is used as the base configuration that is overriden by Spark properties. The recommended way of mine would be to keep configuration outside the code (in conf/hive-site.xml) so there's no need to recompile the app for any configuration changes. – Pauwles 22/3, 2020 at 11:46

I too faced same problem, but resolved. Just follow this steps in Spark 2.0 Version

Step1: Copy hive-site.xml file from Hive conf folder to spark conf.

Step 2: edit spark-env.sh file and configure your mysql driver. (If you are using Mysql as a hive metastore.)

Or add MySQL drivers to Maven/SBT (If using those)

Step3: When you are creating spark session add enableHiveSupport()

val spark = SparkSession.builder.master("local").appName("testing").enableHiveSupport().getOrCreate()

Sample code:

package sparkSQL

/**
  * Created by venuk on 7/12/16.
  */

import org.apache.spark.sql.SparkSession

object hivetable {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.master("local[*]").appName("hivetable").enableHiveSupport().getOrCreate()

    spark.sql("create table hivetab (name string, age int, location string) row format delimited fields terminated by ',' stored as textfile")
    spark.sql("load data local inpath '/home/hadoop/Desktop/asl' into table hivetab").show()
    val x = spark.sql("select * from hivetab")
    x.write.saveAsTable("hivetab")
  }
}

Output:

Nuthouse answered 7/12, 2016 at 13:7 Comment(4)

Isn't x.write.saveAsTable("hivetab") redundant? You already wrote the values via load data local inpath – Galinagalindo 28/8, 2017 at 9:43

What version of Hive , hadoop and Spark did you use? Can you specify it exactly? i'm trying to connect but facing issues – Nittygritty 5/5, 2020 at 11:6

is spark not communicating to hive metastore via thrift protocol? why would it need mysql driver? – Nuke 17/12, 2021 at 3:27

I have no idea why spark needs to be able to connect to the hive metastore standalone backend db, but that's what got it to work for me. – Eveland 20/1, 2023 at 6:29

Some of the similar questions are marked as duplicate, this is to connect to Hive from Spark without using hive.metastore.uris or separate thrift server(9083) and not copying hive-site.xml to the SPARK_CONF_DIR.

import org.apache.spark.sql.SparkSession
val spark = SparkSession
  .builder()
  .appName("hive-check")
  .config(
    "spark.hadoop.javax.jdo.option.ConnectionURL",
    "JDBC_CONNECT_STRING"
  )
  .config(
    "spark.hadoop.javax.jdo.option.ConnectionDriverName",
    "org.postgresql.Driver"
  )
  .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
  .config("spark.hadoop.javax.jdo.option.ConnectionUserName", "JDBC_USER")
  .config("spark.hadoop.javax.jdo.option.ConnectionPassword", "JDBC_PASSWORD")
  .enableHiveSupport()
  .getOrCreate()
spark.catalog.listDatabases.show(false)

Threnode answered 12/5, 2020 at 1:19 Comment(0)

I observed one strange behavior while trying connecting to hive metastore from spark without using hive-site.xml.

Everything works fine When we use hive.metastore.uris property within spark code while creating SparkSession. But if we don't specify in code but specify while using spark-shell or spark-submit with --conf flag it will not work.

It will throw a warning as shown below and it will not connect to remote metastore.

Warning: Ignoring non-Spark config property: hive.metastore.uris

One workaround for this is to use below property.

spark.hadoop.hive.metastore.uris

Ephesian answered 23/8, 2021 at 13:15 Comment(0)

Spark Version : 2.0.2

Hive Version : 1.2.1

Below Java code worked for me to connect to Hive metastore from Spark:

import org.apache.spark.sql.SparkSession;

public class SparkHiveTest {

    public static void main(String[] args) {

        SparkSession spark = SparkSession
                  .builder()
                  .appName("Java Spark Hive Example")
                  .config("spark.master", "local")
                  .config("hive.metastore.uris",                
                   "thrift://abc123.com:9083")
                  .config("spark.sql.warehouse.dir", "/apps/hive/warehouse")
                  .enableHiveSupport()
                  .getOrCreate();

        spark.sql("SELECT * FROM default.survey_data limit 5").show();
    }
}

Jezebel answered 19/12, 2018 at 10:11 Comment(1)

Hi! if you are trying to revive one old question, why not trying with some unanswered question instead of one with already a high ranked answer? Salutes! – Gillen 19/12, 2018 at 10:32

For Spark 3.x:

// Scala
import org.apache.spark.sql.{Row, SaveMode, SparkSession}

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", "hive_warehouse_hdfs_path")
  .enableHiveSupport()
  .getOrCreate()

# Python
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .config("spark.sql.warehouse.dir", "hive_warehouse_hdfs_path") \
    .enableHiveSupport() \
    .getOrCreate()

Check databases available with:

spark.catalog.listDatabases().show()

source

Hoskins answered 28/11, 2021 at 12:54 Comment(0)

Below code worked for me. We can ignore the config of hive.metastore.uris for local metastore, spark will create hive objects in spare-warehouse directory locally.

import org.apache.spark.sql.SparkSession;

object spark_hive_support1 
{
  def main (args: Array[String]) 
   {
    val spark = SparkSession
      .builder()
      .master("yarn")
      .appName("Test Hive Support")
      //.config("hive.metastore.uris", "jdbc:mysql://localhost/metastore")
      .enableHiveSupport
      .getOrCreate();

    import spark.implicits._

    val testdf = Seq(("Word1", 1), ("Word4", 4), ("Word8", 8)).toDF;
    testdf.show;
    testdf.write.mode("overwrite").saveAsTable("WordCount");
  }
}

Flannelette answered 28/4, 2018 at 17:39 Comment(1)

hive.metastore.uris should be given a thrift:// address, not jdbc: – Galinagalindo 8/7, 2019 at 19:40

In Hadoop 3 Spark and Hive catalogs are separated so:

For spark-shell (it comes with .enableHiveSupport() by default) just try:

pyspark-shell --conf spark.hadoop.metastore.catalog.default=hive

For spark-submit job create you spark session like this:

SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()

then add this conf on your spark-submit command:

--conf spark.hadoop.metastore.catalog.default=hive

But for ORC table(and more generally internal table) it is recommended to use HiveWareHouse Connector.

Holler answered 15/8, 2020 at 15:1 Comment(0)

Setting spark.hadoop.metastore.catalog.default=hive worked for me.

Swinish answered 17/10, 2020 at 23:54 Comment(0)

I am getting following error with Spark 2.4.8 Or Spark 3.1.3 or Spark 3.2.2. The hadoop version is 3.2, Hbase 2.4.14 and Hive 3.1.13 and Scala 2.12

Exception in thread "main" java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:253) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) I am calling spark-submit as follows.

export HBASE_JAR_FILES="/usr/local/hbase/lib/hbase-unsafe-4.1.1.jar,/usr/local/hbase/lib/hbase-common-2.4.14.jar,/usr/local/hbase/lib/hbase-client-2.4.14.jar,/usr/local/hbase/lib/hbase-protocol-2.4.14.jar,/usr/local/hbase/lib/guava-11.0.2.jar,/usr/local/hbase/lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar" 


/opt/spark/bin/spark-submit --master local[*] --deploy-mode client --num-executors 1 --executor-cores 1 --executor-memory 480m --driver-memory 512m --driver-class-path $(echo $HBASE_JAR_FILES | tr ',' ':') --jars "$HBASE_JAR_FILES" --files /usr/local/hive/conf/hive-site.xml --conf "spark.hadoop.metastore.catalog.default=hive" --files /usr/local/hbase/conf/hbase-site.xml --class com.hbase.dynamodb.migration.HbaseToDynamoDbSparkMain --conf "spark.driver.maxResultSize=256m" /home/hadoop/scala-2.12/sbt-1.0/HbaseToDynamoDb-assembly-0.1.0-SNAPSHOT.jar

The code is as follows.


val spark: SparkSession = SparkSession.builder()
      .master("local[*]")
      .appName("Hbase To DynamoDb migration demo")
      .config("hive.metastore.warehouse.dir", "/user/hive/warehouse")
      .config("hive.metastore.uris","thrift://localhost:9083")
      .enableHiveSupport()
      .getOrCreate()


    spark.catalog.listDatabases().show()
   val sqlDF = spark.sql("select rowkey, office_address, office_phone, name, personal_phone from hvcontacts")



sqlDF.show()

The hive external table was created on top of Hbase as follows.

create external table if not exists hvcontacts (rowkey STRING, office_address STRING, office_phone STRING, name STRING, personal_phone STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,Office:Address,Office:Phone,Personal:name,Personal:Phone') TBLPROPERTIES ('hbase.table.name' = 'Contacts');

The metastore is in mysql and I can query tbls table to verify the external table in hive. Is there anyone else facing similar issue?

NOTE: I am not using hive spark connector here.

Ragucci answered 23/9, 2022 at 1:57 Comment(2)

This is the known bug reported at issues.apache.org/jira/browse/HIVE-24706. It seems the functionality has been broken in Spark 3.x. It works fine with Spark 2.4.x and Scala 2.11. – Ragucci 21/10, 2022 at 13:28

The issue has been fixed as per github.com/apache/spark/pull/31147 pull request but not in offical release yet. – Ragucci 21/10, 2022 at 13:41

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags