HDP3.1.4 - Spark2 with Hive Warehouse Connector error using spark-submit and pyspark shell: KeeperErrorCode = ConnectionLoss

Environment:

HDP 3.1.4 - configured and tested Hive server 2 - tested and working
Hive server 2 LLAP - tested and working Spark configured as per documentation to use Hive Warehouse Connector (HWC)
Apache Zeppelin - spark2 interpreter configured to use HWC

Trying to execute the following script:

from pyspark.sql import SparkSession
from pyspark_llap import HiveWarehouseSession

# Create spark session
spark = SparkSession.builder.appName("LLAP Test - CLI").enableHiveSupport().getOrCreate()

# Create HWC session
hive = HiveWarehouseSession.session(spark).userPassword('hive','hive').build()

# Execute a query to read from Spark using HWC
hive.executeQuery("select * from wifi_table where partit='2019-12-02'").show(20)

Problem: When submitting an application with spark-submit or using the pyspark shell with the above script (or any script that executes a query to with the HiveWarehouseSession) and the spark job gets stuck, throwing an exception: java.lang.RuntimeException: java.io.IOException: shadecurator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss

The command to execute is the following:

$ /usr/hdp/current/spark2-client/bin/spark-submit --master yarn --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip spark_compare_test.py

Here is the stacktrace:

[...]
20/01/03 12:39:55 INFO SparkContext: Starting job: showString at NativeMethodAccessorImpl.java:0
20/01/03 12:39:56 INFO DAGScheduler: Got job 0 (showString at NativeMethodAccessorImpl.java:0) with 1 output partitions
20/01/03 12:39:56 INFO DAGScheduler: Final stage: ResultStage 0 (showString at NativeMethodAccessorImpl.java:0)
20/01/03 12:39:56 INFO DAGScheduler: Parents of final stage: List()
20/01/03 12:39:56 INFO DAGScheduler: Missing parents: List()
20/01/03 12:39:56 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at showString at NativeMethodAccessorImpl.java:0), which has no missing parents
20/01/03 12:39:56 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 9.5 KB, free 366.3 MB)
20/01/03 12:39:56 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.6 KB, free 366.3 MB)
20/01/03 12:39:56 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on EDGE01.machine:38050 (size: 3.6 KB, free: 366.3 MB)
20/01/03 12:39:56 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1039
20/01/03 12:39:56 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at showString at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
20/01/03 12:39:56 INFO YarnScheduler: Adding task set 0.0 with 1 tasks
20/01/03 12:39:56 WARN TaskSetManager: Stage 0 contains a task of very large size (465 KB). The maximum recommended task size is 100 KB.
20/01/03 12:39:56 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, DN02.machine, executor 2, partition 0, NODE_LOCAL, 476705 bytes)
20/01/03 12:39:56 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on DN02.machine:41521 (size: 3.6 KB, free: 366.3 MB)
20/01/03 12:42:08 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, DN02.machine, executor 2): java.lang.RuntimeException: java.io.IOException: shadecurator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
    at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReaderFactory.createDataReader(HiveWarehouseDataReaderFactory.java:66)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: shadecurator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.hadoop.hive.registry.impl.ZkRegistryBase.ensureInstancesCache(ZkRegistryBase.java:619)
    at org.apache.hadoop.hive.llap.registry.impl.LlapZookeeperRegistryImpl.getInstances(LlapZookeeperRegistryImpl.java:422)
    at org.apache.hadoop.hive.llap.registry.impl.LlapZookeeperRegistryImpl.getInstances(LlapZookeeperRegistryImpl.java:63)
    at org.apache.hadoop.hive.llap.registry.impl.LlapRegistryService.getInstances(LlapRegistryService.java:181)
    at org.apache.hadoop.hive.llap.registry.impl.LlapRegistryService.getInstances(LlapRegistryService.java:177)
    at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getServiceInstanceForHost(LlapBaseInputFormat.java:415)
    at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getServiceInstance(LlapBaseInputFormat.java:397)
    at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getRecordReader(LlapBaseInputFormat.java:160)
    at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReader.getRecordReader(HiveWarehouseDataReader.java:72)
    at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReader.<init>(HiveWarehouseDataReader.java:50)
    at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReaderFactory.getDataReader(HiveWarehouseDataReaderFactory.java:72)
    at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReaderFactory.createDataReader(HiveWarehouseDataReaderFactory.java:64)
    ... 18 more
Caused by: shadecurator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
    at shadecurator.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
    at shadecurator.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
    at shadecurator.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
    at shadecurator.org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:489)
    at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:199)
    at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:193)
    at shadecurator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
    at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:190)
    at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:175)
    at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32)
    at shadecurator.org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:194)
    at shadecurator.org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:61)
    at shadecurator.org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53)
    at shadecurator.org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:576)
    at shadecurator.org.apache.curator.framework.recipes.cache.PathChildrenCache.rebuild(PathChildrenCache.java:326)
    at shadecurator.org.apache.curator.framework.recipes.cache.PathChildrenCache.start(PathChildrenCache.java:303)
    at org.apache.hadoop.hive.registry.impl.ZkRegistryBase.ensureInstancesCache(ZkRegistryBase.java:597)
    ... 29 more
[...]

I have tried the following with no effect whatsoever:

Checked zookeeper health and connection limiting
Changed zookeeper hosts
Increased zookeeper timeout to 10s, 120s and 600s and no effect
Tried to submit the application on multiple nodes, the error persists

There is another strange behavior, running the script on the Zeppelin spark2 interpreter there is no error and the HWC works. I have compared the environments, and there is no configuration mismatch on the main variables.

At this point I'm stuck and don't know where to look for further troubleshooting. I can add more information as requested.

Recommended topics

Hot tags