I have been using Apache Arrow with Spark for a while in Python and have been easily able to convert between dataframes and Arrow objects by using Pandas as an intermediary.
Recently, however, I’ve moved from Python to Scala for interacting with Spark and using Arrow isn’t as intuitive in Scala (Java) as it is in Python. My basic need is to convert a Spark dataframe (or RDD since they’re easily convertible) to an Arrow object as quickly as possible. My initial thought was to convert to Parquet first and go from Parquet to Arrow since I remembered that pyarrow could read from Parquet. However, and please correct me if I’m wrong, after looking at the Arrow Java docs for a while I couldn’t find a Parquet to Arrow function. Does this function not exist in the Java version? Is there another way to get a Spark dataframe to an Arrow object? Perhaps converting the dataframe's columns to arrays then converting to arrow objects?
Any help would be much appreciated. Thank you
EDIT: Found the following link that converts a parquet schema to an Arrow schema. But it doesn't seem to return an Arrow object from a parquet file like I need: https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-arrow/src/main/java/org/apache/parquet/arrow/schema/SchemaConverter.java