Start HiveThriftServer programmatically in Python
Asked Answered
S

2

3

In the spark-shell (scala), we import, org.apache.spark.sql.hive.thriftserver._ for starting Hive Thrift server programatically for a particular hive context as HiveThriftServer2.startWithContext(hiveContext) to expose a registered temp table for that particular session.

How can we do the same using python? Is there a package / api on python for importing HiveThriftServer? Any other thoughts / recommendations appreciated.

We have used pyspark for creating a dataframe

Thanks

Ravi Narayanan

Sweptwing answered 14/4, 2016 at 16:32 Comment(7)
why do you need a thrift server since it is a temporary tables? couldn't you just create your own Hivecontext which will connect to the local temporary created metastore?Prehistoric
And BTW, why do you need to start it from your code?Prehistoric
If we start the thrift server as a daemon, we are unable to view the temp table (the session is different from the session from which we start the HiveContext and temp table will be available for the particular session)Sweptwing
are you starting a metastore service? If not , I m not surprised, cause when you run Spark Thrift server, it will create its metastore backend. and whithin your code, also you create another metastore backend and the two metastores are independent.Prehistoric
Did you figure out how to do this?Roomette
@Roomette did you figure out how to do this?Pochard
Unfortunately not - I switched to Scala. You might be able to do it through py4j.Roomette
P
5

You can import it using py4j java gateway. The following code worked for spark 2.0.2 and could query temp tables registered in python script through beeline.

from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"")

spark = SparkSession \
        .builder \
        .appName(app_name) \
        .master(master)\
        .enableHiveSupport()\
        .config('spark.sql.hive.thriftServer.singleSession', True)\
        .getOrCreate()
sc=spark.sparkContext
sc.setLogLevel('INFO')

#Start the Thrift Server using the jvm and passing the same spark session corresponding to pyspark session in the jvm side.
sc._gateway.jvm.org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark._jwrapped)

spark.sql('CREATE TABLE myTable')
data_file="path to csv file with data"
dataframe = spark.read.option("header","true").csv(data_file).cache()
dataframe.createOrReplaceTempView("myTempView")

Then go to beeline to check if it correclty started:

in terminal> $SPARK_HOME/bin/beeline
beeline> !connect jdbc:hive2://localhost:10000
beeline> show tables;

It should show the tables and temp tables/views created in python including "myTable" and "myTempView" above. It is necessary to have the same spark session in order to see temporary views

(see ans: Avoid starting HiveThriftServer2 with created context programmatically.
NOTE: It's possible to access hive tables even if the Thrift server is started from terminal and connected to the same metastore, however temp views cannot be accessed as they are in the spark session and not written to metastore)

Pierian answered 29/12, 2016 at 22:3 Comment(0)
S
0

For Spark 3, the following works:

from py4j.java_gateway import java_import
from pyspark.sql import SparkSession

spark = SparkSession.builder.enableHiveSupport().getOrCreate()
sc = spark.sparkContext

java_import(sc._jvm, "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2")
args = sys.argv[1:]
java_args = sc._gateway.new_array(sc._gateway.jvm.java.lang.String, len(args))

for i, arg in enumerate(args):
    java_args[i] = arg
sc._jvm.org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(java_args)

Note that the main method of the HiveThriftServer2 class calls the startWithContext method. (See here for the source code)

Supervision answered 11/9, 2023 at 13:17 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.