How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
Asked Answered
K

11

30

I'm using HiveContext with SparkSQL and I'm trying to connect to a remote Hive metastore, the only way to set the hive metastore is through including the hive-site.xml on the classpath (or copying it to /etc/spark/conf/).

Is there a way to set this parameter programmatically in a java code without including the hive-site.xml ? If so what is the Spark configuration to use ?

Kira answered 13/8, 2015 at 6:4 Comment(0)
K
33

For Spark 1.x, you can set with :

System.setProperty("hive.metastore.uris", "thrift://METASTORE:9083");
            
final SparkConf conf = new SparkConf();
SparkContext sc = new SparkContext(conf);
HiveContext hiveContext = new HiveContext(sc);

Or

final SparkConf conf = new SparkConf();
SparkContext sc = new SparkContext(conf);
HiveContext hiveContext = new HiveContext(sc);
hiveContext.setConf("hive.metastore.uris", "thrift://METASTORE:9083");

Update If your Hive is Kerberized :

Try setting these before creating the HiveContext :

System.setProperty("hive.metastore.sasl.enabled", "true");
System.setProperty("hive.security.authorization.enabled", "false");
System.setProperty("hive.metastore.kerberos.principal", hivePrincipal);
System.setProperty("hive.metastore.execute.setugi", "true");
Kira answered 13/8, 2015 at 16:31 Comment(3)
In case of a Kerberized remote Hive cluster what are the additional HiveContext config setting needed? The above code doesn't work for me in this case and I'm wondering if you have a solution. Thanks.Far
Still doesn't work. My settings: ("hive.metastore.uris","myValue") ("login.user","myValue") ("keytab.file", "myValue") ("sun.security.krb5.debug","false") ("java.security.krb5.conf","myValue") ("java.library.path","myValue") ("hadoop.home.dir","myValue") ("hadoop.security.authentication","kerberos") ("hive.metastore.sasl.enabled", "true") ("hive.security.authorization.enabled", "false") ("hive.metastore.kerberos.principal", "myValue") ("hive.metastore.execute.setugi", "true")Far
Looks to me my local Spark is not even attempting to connect the remote Hive based on what I see in the logFar
A
30

In spark 2.0.+ it should look something like that:

Don't forget to replace the "hive.metastore.uris" with yours. This assume that you have a hive metastore service started already (not a hiveserver).

 val spark = SparkSession
          .builder()
          .appName("interfacing spark sql to hive metastore without configuration file")
          .config("hive.metastore.uris", "thrift://localhost:9083") // replace with your hivemetastore service's thrift url
          .enableHiveSupport() // don't forget to enable hive support
          .getOrCreate()

        import spark.implicits._
        import spark.sql
        // create an arbitrary frame
        val frame = Seq(("one", 1), ("two", 2), ("three", 3)).toDF("word", "count")
        // see the frame created
        frame.show()
        /**
         * +-----+-----+
         * | word|count|
         * +-----+-----+
         * |  one|    1|
         * |  two|    2|
         * |three|    3|
         * +-----+-----+
         */
        // write the frame
        frame.write.mode("overwrite").saveAsTable("t4")
Alpha answered 28/11, 2016 at 16:37 Comment(2)
If I provide configurations using above config(..) method while simultaneously also providing hive-site.xml, which one would persist? Is there a way to control that?Steamboat
Both. hive-site.xml is used as the base configuration that is overriden by Spark properties. The recommended way of mine would be to keep configuration outside the code (in conf/hive-site.xml) so there's no need to recompile the app for any configuration changes.Pauwles
N
7

I too faced same problem, but resolved. Just follow this steps in Spark 2.0 Version

Step1: Copy hive-site.xml file from Hive conf folder to spark conf. enter image description here

Step 2: edit spark-env.sh file and configure your mysql driver. (If you are using Mysql as a hive metastore.) enter image description here

Or add MySQL drivers to Maven/SBT (If using those)

Step3: When you are creating spark session add enableHiveSupport()

val spark = SparkSession.builder.master("local").appName("testing").enableHiveSupport().getOrCreate()

Sample code:

package sparkSQL

/**
  * Created by venuk on 7/12/16.
  */

import org.apache.spark.sql.SparkSession

object hivetable {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.master("local[*]").appName("hivetable").enableHiveSupport().getOrCreate()

    spark.sql("create table hivetab (name string, age int, location string) row format delimited fields terminated by ',' stored as textfile")
    spark.sql("load data local inpath '/home/hadoop/Desktop/asl' into table hivetab").show()
    val x = spark.sql("select * from hivetab")
    x.write.saveAsTable("hivetab")
  }
}

Output:

enter image description here

Nuthouse answered 7/12, 2016 at 13:7 Comment(4)
Isn't x.write.saveAsTable("hivetab") redundant? You already wrote the values via load data local inpathGalinagalindo
What version of Hive , hadoop and Spark did you use? Can you specify it exactly? i'm trying to connect but facing issuesNittygritty
is spark not communicating to hive metastore via thrift protocol? why would it need mysql driver?Nuke
I have no idea why spark needs to be able to connect to the hive metastore standalone backend db, but that's what got it to work for me.Eveland
T
6

Some of the similar questions are marked as duplicate, this is to connect to Hive from Spark without using hive.metastore.uris or separate thrift server(9083) and not copying hive-site.xml to the SPARK_CONF_DIR.

import org.apache.spark.sql.SparkSession
val spark = SparkSession
  .builder()
  .appName("hive-check")
  .config(
    "spark.hadoop.javax.jdo.option.ConnectionURL",
    "JDBC_CONNECT_STRING"
  )
  .config(
    "spark.hadoop.javax.jdo.option.ConnectionDriverName",
    "org.postgresql.Driver"
  )
  .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
  .config("spark.hadoop.javax.jdo.option.ConnectionUserName", "JDBC_USER")
  .config("spark.hadoop.javax.jdo.option.ConnectionPassword", "JDBC_PASSWORD")
  .enableHiveSupport()
  .getOrCreate()
spark.catalog.listDatabases.show(false)
Threnode answered 12/5, 2020 at 1:19 Comment(0)
E
6

I observed one strange behavior while trying connecting to hive metastore from spark without using hive-site.xml.

Everything works fine When we use hive.metastore.uris property within spark code while creating SparkSession. But if we don't specify in code but specify while using spark-shell or spark-submit with --conf flag it will not work.

It will throw a warning as shown below and it will not connect to remote metastore.

Warning: Ignoring non-Spark config property: hive.metastore.uris

One workaround for this is to use below property.

spark.hadoop.hive.metastore.uris
Ephesian answered 23/8, 2021 at 13:15 Comment(0)
J
4

Spark Version : 2.0.2

Hive Version : 1.2.1

Below Java code worked for me to connect to Hive metastore from Spark:

import org.apache.spark.sql.SparkSession;

public class SparkHiveTest {

    public static void main(String[] args) {

        SparkSession spark = SparkSession
                  .builder()
                  .appName("Java Spark Hive Example")
                  .config("spark.master", "local")
                  .config("hive.metastore.uris",                
                   "thrift://abc123.com:9083")
                  .config("spark.sql.warehouse.dir", "/apps/hive/warehouse")
                  .enableHiveSupport()
                  .getOrCreate();

        spark.sql("SELECT * FROM default.survey_data limit 5").show();
    }
}
Jezebel answered 19/12, 2018 at 10:11 Comment(1)
Hi! if you are trying to revive one old question, why not trying with some unanswered question instead of one with already a high ranked answer? Salutes!Gillen
H
4

For Spark 3.x:

// Scala
import org.apache.spark.sql.{Row, SaveMode, SparkSession}

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", "hive_warehouse_hdfs_path")
  .enableHiveSupport()
  .getOrCreate()
# Python
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .config("spark.sql.warehouse.dir", "hive_warehouse_hdfs_path") \
    .enableHiveSupport() \
    .getOrCreate()

Check databases available with:

spark.catalog.listDatabases().show()

source

Hoskins answered 28/11, 2021 at 12:54 Comment(0)
F
2

Below code worked for me. We can ignore the config of hive.metastore.uris for local metastore, spark will create hive objects in spare-warehouse directory locally.

import org.apache.spark.sql.SparkSession;

object spark_hive_support1 
{
  def main (args: Array[String]) 
   {
    val spark = SparkSession
      .builder()
      .master("yarn")
      .appName("Test Hive Support")
      //.config("hive.metastore.uris", "jdbc:mysql://localhost/metastore")
      .enableHiveSupport
      .getOrCreate();

    import spark.implicits._

    val testdf = Seq(("Word1", 1), ("Word4", 4), ("Word8", 8)).toDF;
    testdf.show;
    testdf.write.mode("overwrite").saveAsTable("WordCount");
  }
}
Flannelette answered 28/4, 2018 at 17:39 Comment(1)
hive.metastore.uris should be given a thrift:// address, not jdbc:Galinagalindo
H
1

In Hadoop 3 Spark and Hive catalogs are separated so:

For spark-shell (it comes with .enableHiveSupport() by default) just try:

pyspark-shell --conf spark.hadoop.metastore.catalog.default=hive

For spark-submit job create you spark session like this:

SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()

then add this conf on your spark-submit command:

--conf spark.hadoop.metastore.catalog.default=hive

But for ORC table(and more generally internal table) it is recommended to use HiveWareHouse Connector.

Holler answered 15/8, 2020 at 15:1 Comment(0)
S
1

Setting spark.hadoop.metastore.catalog.default=hive worked for me.

Swinish answered 17/10, 2020 at 23:54 Comment(0)
R
0

I am getting following error with Spark 2.4.8 Or Spark 3.1.3 or Spark 3.2.2. The hadoop version is 3.2, Hbase 2.4.14 and Hive 3.1.13 and Scala 2.12

Exception in thread "main" java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:253) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) I am calling spark-submit as follows.

export HBASE_JAR_FILES="/usr/local/hbase/lib/hbase-unsafe-4.1.1.jar,/usr/local/hbase/lib/hbase-common-2.4.14.jar,/usr/local/hbase/lib/hbase-client-2.4.14.jar,/usr/local/hbase/lib/hbase-protocol-2.4.14.jar,/usr/local/hbase/lib/guava-11.0.2.jar,/usr/local/hbase/lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar" 


/opt/spark/bin/spark-submit --master local[*] --deploy-mode client --num-executors 1 --executor-cores 1 --executor-memory 480m --driver-memory 512m --driver-class-path $(echo $HBASE_JAR_FILES | tr ',' ':') --jars "$HBASE_JAR_FILES" --files /usr/local/hive/conf/hive-site.xml --conf "spark.hadoop.metastore.catalog.default=hive" --files /usr/local/hbase/conf/hbase-site.xml --class com.hbase.dynamodb.migration.HbaseToDynamoDbSparkMain --conf "spark.driver.maxResultSize=256m" /home/hadoop/scala-2.12/sbt-1.0/HbaseToDynamoDb-assembly-0.1.0-SNAPSHOT.jar

The code is as follows.


val spark: SparkSession = SparkSession.builder()
      .master("local[*]")
      .appName("Hbase To DynamoDb migration demo")
      .config("hive.metastore.warehouse.dir", "/user/hive/warehouse")
      .config("hive.metastore.uris","thrift://localhost:9083")
      .enableHiveSupport()
      .getOrCreate()


    spark.catalog.listDatabases().show()
   val sqlDF = spark.sql("select rowkey, office_address, office_phone, name, personal_phone from hvcontacts")



sqlDF.show()

The hive external table was created on top of Hbase as follows.

create external table if not exists hvcontacts (rowkey STRING, office_address STRING, office_phone STRING, name STRING, personal_phone STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,Office:Address,Office:Phone,Personal:name,Personal:Phone') TBLPROPERTIES ('hbase.table.name' = 'Contacts');

The metastore is in mysql and I can query tbls table to verify the external table in hive. Is there anyone else facing similar issue?

NOTE: I am not using hive spark connector here.

Ragucci answered 23/9, 2022 at 1:57 Comment(2)
This is the known bug reported at issues.apache.org/jira/browse/HIVE-24706. It seems the functionality has been broken in Spark 3.x. It works fine with Spark 2.4.x and Scala 2.11.Ragucci
The issue has been fixed as per github.com/apache/spark/pull/31147 pull request but not in offical release yet.Ragucci

© 2022 - 2024 — McMap. All rights reserved.