Spark dataframe to arrow
Asked Answered
A

4

10

I have been using Apache Arrow with Spark for a while in Python and have been easily able to convert between dataframes and Arrow objects by using Pandas as an intermediary.

Recently, however, I’ve moved from Python to Scala for interacting with Spark and using Arrow isn’t as intuitive in Scala (Java) as it is in Python. My basic need is to convert a Spark dataframe (or RDD since they’re easily convertible) to an Arrow object as quickly as possible. My initial thought was to convert to Parquet first and go from Parquet to Arrow since I remembered that pyarrow could read from Parquet. However, and please correct me if I’m wrong, after looking at the Arrow Java docs for a while I couldn’t find a Parquet to Arrow function. Does this function not exist in the Java version? Is there another way to get a Spark dataframe to an Arrow object? Perhaps converting the dataframe's columns to arrays then converting to arrow objects?

Any help would be much appreciated. Thank you

EDIT: Found the following link that converts a parquet schema to an Arrow schema. But it doesn't seem to return an Arrow object from a parquet file like I need: https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-arrow/src/main/java/org/apache/parquet/arrow/schema/SchemaConverter.java

Asben answered 27/7, 2017 at 17:4 Comment(1)
Wes McKinney is one of the best people [IMHO] to answer this question. I tweeted him (twitter.com/gstaubli/status/895763929653157888) in the hopes of getting a response. Fingers crossed.Ginni
W
5

There is not a Parquet <-> Arrow converter available as a library in Java yet. You could have a look at the Arrow-based Parquet converter in Dremio (https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet) for inspiration. I am sure the Apache Parquet project would welcome your contribution implementing this functionality.

We have developed an Arrow reader/writer for Parquet in the C++ implementation: https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow. Nested data support is not complete yet, but it should be more complete within the next 6-12 months (sooner as contributors step up).

Widner answered 11/8, 2017 at 14:37 Comment(3)
Sorry for the side question but trying to understand how the performance benefits of Apache Arrow are obtained with the Java implementation. Looking at github.com/apache/arrow/tree/master/java/memory/src/main/java/… and github.com/apache/arrow/tree/master/cpp/src/arrow/python makes me think that arrow-cpp is strictly for Python and not to be used with Java/JVM. Is this correct, Wes?Vashtivashtia
is this in the works or in the queue of being worked on?Minutia
There has been some discussion on the mailing list. Note that open source projects don't really have "queues" -- if you really want something you often have to build it yourself, or wait for time to pass until someone else does.Widner
C
3

Now there's an answer, Arrow can be used to convert Spark DataFrames to Pandas DataFrames or when calling Pandas UDFs. Please see the SQL PySpark Pandas with Arrow documentation page.

Conscript answered 30/5, 2020 at 22:31 Comment(0)
F
1

Spark 3.3 will have mapInArrow API call, similar to already existing mapInPandas API call.

Here's first PR that adds this to Python - https://github.com/apache/spark/pull/34505

There will be another similar Spark Scala API call too by the time 3.3 releases.

Not sure what's exactly your use case, but this seems may help.

PS. Notice initially this API is planned as a developer-level, as working with Arrow may not be very user-friendly at first. This may be great if you're developing a library on top of Spark/Arrow, for example, when you can abstract away some of those Arrow nuances.

Firenew answered 28/11, 2021 at 5:11 Comment(0)
D
-1

Apache Arrow is a cross-language development platform and supports in-memory columnar data structures. As it is cross-language platform it helps to write in different programming language such as Python, Java, C, C++, C#, Go, R, Ruby, JavaScript, MATLAB, Rust.

As it supports Java, it also support Scala language as both run on top of jvm. But to have Scala functionalities to convert into Scala objects to Arrow Objects, it must have to go through python because Arrow is written in python and it supports python extensively.

Ultimately Python talks with Scala and and bring it jvm property readily available to make use of it.

Please go through below link where detailed description is available: https://databricks.com/session/accelerating-tensorflow-with-apache-arrow-on-spark-bonus-making-it-available-in-scala

Doby answered 20/3, 2019 at 7:41 Comment(1)
Arrow does support Python extensively, but it's mainly written in C++.Firework

© 2022 - 2024 — McMap. All rights reserved.