I was facing the same problem with Pyarrow.
My environment:
- Python 3.6
- Pyspark 2.4.4
- Pyarrow 4.0.1
- Jupyter notebook
- Spark cluster on GCS
When I try to enable Pyarrow optimization like this:
spark.conf.set('spark.sql.execution.arrow.enabled', 'true')
I get the following warning:
createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however failed by the reason below: TypeError: 'JavaPackage' object is not callable
I solved this problem by:
- Printed the config of spark session:
import os
from pyspark import SparkConf
spark_config = SparkConf().getAll()
for conf in spark_config:
print(conf)
This will print the key-value pairs of spark configurations.
- Found the path to my jar files in this key-value pair:
('spark.yarn.jars', 'path\to\jar\files')
- After finding the path where my jar files are located, I printed the names of jars for Pyarrow, like this:
jar_names = os.listdir('path\to\jar\files')
for jar_name in jar_names:
if 'arrow' in jar_name:
print(jar_name)
Found the following jars:
arrow-format-0.10.0.jar
arrow-memory-0.10.0.jar
arrow-vector-0.10.0.jar
- Then added the path of arrow jars in the spark session config:
For adding multiple jar file paths use : as delimiter.
spark.conf.set('spark.driver.extraClassPath', 'path\to\jar\files\arrow-format-0.10.0.jar:path\to\jar\files\arrow-memory-0.10.0.jar:path\to\jar\files\arrow-vector-0.10.0.jar')
- Then restarted the kernel and Pyarrow optimization worked
spark.sql.execution.arrow.enabled=True is experimental
– Autarchy