What is a common use case for Apache arrow in a data pipeline built in Spark
Asked Answered
M

1

6

What is the purpose of Apache Arrow? It converts from one binary format to another, but why do i need that? If I have a spark program,then spark can read parquet,so why do i need to convert it into another format,midway through my processing? Is it to pass that data in memory to another language like python or java without having to write it to a text/json format?

Maudiemaudlin answered 11/5, 2021 at 20:28 Comment(0)
G
19

Disclaimer: This question is broad and I am somewhat involved with the Apache Arrow project so my answer may/or may not be biased.

This question is broad in the sense that a question like a "When should I use NoSQL?" type of question is broad. It depends. This answer is based on the assumption that you already have a Spark pipeline. This answer is not an attempt at Spark Vs. Arrow (which is even broader to the point I wouldn't touch it).

Many Apache Spark pipelines would never need to use Arrow. Spark, unlike Arrow-based pipelines, has its own in-memory dataframe format (https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html) which, to my knowledge, cannot be zero-copied to Arrow. So converting from one format to the other is likely to introduce a performance hit of some kind and any benefit you achieve is going to have to be weighed against that.

You brought up one great example, which is switching to other languages / libraries. For example, Spark currently uses Arrow to apply a Pandas UDF (https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html). In this case, whenever you are going to a library that doesn't use Spark's in-memory format (which means any non-Java library and some Java libraries) you are going to have to do a translation between in-memory formats and so you are going to pay the performance hit anyways and you might as well switch to Arrow.

There are some things that are faster with Arrow's format than Spark's format. I'm not going to try and list those here because, for the most part, the benefit isn't going to outweigh the cost of going Spark -> Arrow in the first place and I don't know that I have enough information to do so in any sort of comprehensive way. Instead, I'll provide one concrete example:

A common case for Arrow is when you need to transfer a table between processes that are on the same machine (or have a very fast I/O channel in between). In that case the cost of serializing to parquet and then deserializing back (Spark must do this to go Spark Dataframe -> Parquet -> Wire -> Parquet -> Spark Dataframe) is more expensive than the I/O saved (Parquet is more compact than Spark Dataframe so you will save some in transmission). If you have a lot of this type of communication it may be beneficial to leave Spark, do these transmissions in Arrow, and then return to Spark.

Graves answered 11/5, 2021 at 21:23 Comment(3)
Thank you. We need to use a pandas UDF in pyspark because the data scientist does not want to rewrite the pipeline in pyspark,correct?Maudiemaudlin
That sounds right. pyspark is an interesting case. I'm a bit out of depth here but as I understand it pyspark does not actually have the data marshaled into python. Instead, the data is in a JVM accessed via Py4J (or the data is remote). So dataframe manipulation functions in pyspark are instructions to run in Py4J. Note: pyarrow actually works similarly, except the data is in C instead of Java.Graves
In theory there would be other ways to solve the problem. Koalas is another approach that mimics pandas' frontend with pyspark backend.Graves

© 2022 - 2024 — McMap. All rights reserved.