pyarrow error: toPandas attempted Arrow optimization
Asked Answered
C

2

14

when I set pyarrow to true we using spark session, but when I run toPandas(), it throws the error:

"toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this"

May I know why it happens?

Carthusian answered 28/8, 2018 at 6:16 Comment(2)
What are the datatypes in your dataframe? Remember not all types of data have support yet. arrow.apache.org/blog/2017/07/26/spark-arrow - check notes on usage.Valence
And also the source code says this: spark.sql.execution.arrow.enabled=True is experimentalAutarchy
S
5

By default PyArrow is disabled but it seems in your case it is enabled, you have to manually disable this configuration either from the current spark application session or permanently from the Spark configuration file.

If you want to disable this for all of you spark sessions, add below line to your Spark configuration at SPARK_HOME/conf/spark-defaults .conf. spark.sql.execution.arrow.enabled=false

But I would suggest using PyArrow if you are using pandas in your spark application, it will speed the data conversion between spark and pandas.

For more on PyArrow please visit my blog.

Superannuation answered 22/8, 2019 at 5:34 Comment(0)
C
4

I was facing the same problem with Pyarrow.

My environment:

  • Python 3.6
  • Pyspark 2.4.4
  • Pyarrow 4.0.1
  • Jupyter notebook
  • Spark cluster on GCS

When I try to enable Pyarrow optimization like this:

spark.conf.set('spark.sql.execution.arrow.enabled', 'true')

I get the following warning:

createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however failed by the reason below: TypeError: 'JavaPackage' object is not callable

I solved this problem by:

  1. Printed the config of spark session:
import  os
from pyspark import SparkConf

spark_config = SparkConf().getAll()
for conf in spark_config:
    print(conf)

This will print the key-value pairs of spark configurations.

  1. Found the path to my jar files in this key-value pair:

('spark.yarn.jars', 'path\to\jar\files')

  1. After finding the path where my jar files are located, I printed the names of jars for Pyarrow, like this:
jar_names = os.listdir('path\to\jar\files')
for jar_name in jar_names:
    if 'arrow' in jar_name:
        print(jar_name)

Found the following jars:

arrow-format-0.10.0.jar
arrow-memory-0.10.0.jar
arrow-vector-0.10.0.jar
  1. Then added the path of arrow jars in the spark session config: For adding multiple jar file paths use : as delimiter.

spark.conf.set('spark.driver.extraClassPath', 'path\to\jar\files\arrow-format-0.10.0.jar:path\to\jar\files\arrow-memory-0.10.0.jar:path\to\jar\files\arrow-vector-0.10.0.jar')

  1. Then restarted the kernel and Pyarrow optimization worked
Cyrillus answered 2/7, 2021 at 11:38 Comment(1)
spark.conf.set('spark.sql.execution.arrow.enabled', 'true) maybe not work in cluster mode, --conf spark.sql.execution.arrow.enabled=true work.Speiss

© 2022 - 2024 — McMap. All rights reserved.