Unable to use an existing Hive permanent UDF from Spark SQL

J

2

25

I have previously registered a UDF with hive. It is permanent not TEMPORARY. It works in beeline.

CREATE FUNCTION normaliseURL AS 'com.example.hive.udfs.NormaliseURL' USING JAR 'hdfs://udfs/hive-udfs.jar';

I have spark configured to use the hive metastore. The config is working as I can query hive tables. I can see the UDF;

In [9]: spark.sql('describe function normaliseURL').show(truncate=False)
+-------------------------------------------+
|function_desc                              |
+-------------------------------------------+
|Function: default.normaliseURL             |
|Class: com.example.hive.udfs.NormaliseURL  |
|Usage: N/A.                                |
+-------------------------------------------+

However I cannot use the UDF in a sql statement;

spark.sql('SELECT normaliseURL("value")')
AnalysisException: "Undefined function: 'default.normaliseURL'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7"

If I attempt to register the UDF with spark (bypassing the metastore) it fails to register it, suggesting that it does already exist.

In [12]: spark.sql("create function normaliseURL as 'com.example.hive.udfs.NormaliseURL'")
AnalysisException: "Function 'default.normaliseURL' already exists in database 'default';"

I'm using Spark 2.0, hive metastore 1.1.0. The UDF is scala, my spark driver code is python.

I'm stumped.

Am I correct in my assumption that Spark can utilise metastore-defined permanent UDFs?
Am I creating the function correctly in hive?

Jamnis answered 18/8, 2016 at 16:52 Comment(7)

In your SparkSession.builder did you defined enableHiveSupport()? – Strauss 25/8, 2016 at 17:0

Yeah I did. I can see, and query hive-defined tables from spark so I assume hive support is enabled appropriately. – Jamnis 26/8, 2016 at 15:55

Humm... Did you defined your UDF jar in spark-submit or spark-shell call? Like: ./bin/spark-shell --jars <path-to-your-hive-udf>.jar – Strauss 26/8, 2016 at 16:33

@RobCowie: are you sure the DataType of value column is same in both UDF Class(in Scala) and the query what it returns? – Epizoic 29/10, 2016 at 11:23

can you paste com.example.hive.udfs.NormaliseURL code ? – Earing 12/11, 2016 at 5:0

try spark.sql('SELECT normaliseURL("value") from values(1)') – Earing 19/11, 2016 at 10:20

Hi, i suppose that the jar your are using for the udf is not available to spark and you get that error, try to check this answer as it seems to me to be your problem: #43272946 – Receptive 19/7, 2017 at 10:17

L

2

Issue is Spark 2.0 is not able to execute the functions whose JARs are located on HDFS.

Spark SQL: Thriftserver unable to run a registered Hive UDTF

One workaround is to define the function as a temporary function in Spark job with jar path pointing to a local edge-node path. Then call the function in same Spark job.

CREATE TEMPORARY FUNCTION functionName as 'com.test.HiveUDF' USING JAR '/user/home/dir1/functions.jar'

Lombardi answered 28/8, 2017 at 18:13 Comment(0)

I

0

It will work on spark on yarn environment however as suggested you need to use spark-shell --jars <path-to-your-hive-udf>.jar not in hdfs but in local.

Impose answered 20/12, 2016 at 22:38 Comment(0)

Recommended topics

Hot tags