What is a common use case for Apache arrow in a data pipeline built in Spark

Disclaimer: This question is broad and I am somewhat involved with the Apache Arrow project so my answer may/or may not be biased.

This question is broad in the sense that a question like a "When should I use NoSQL?" type of question is broad. It depends. This answer is based on the assumption that you already have a Spark pipeline. This answer is not an attempt at Spark Vs. Arrow (which is even broader to the point I wouldn't touch it).

Many Apache Spark pipelines would never need to use Arrow. Spark, unlike Arrow-based pipelines, has its own in-memory dataframe format (https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html) which, to my knowledge, cannot be zero-copied to Arrow. So converting from one format to the other is likely to introduce a performance hit of some kind and any benefit you achieve is going to have to be weighed against that.

You brought up one great example, which is switching to other languages / libraries. For example, Spark currently uses Arrow to apply a Pandas UDF (https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html). In this case, whenever you are going to a library that doesn't use Spark's in-memory format (which means any non-Java library and some Java libraries) you are going to have to do a translation between in-memory formats and so you are going to pay the performance hit anyways and you might as well switch to Arrow.

There are some things that are faster with Arrow's format than Spark's format. I'm not going to try and list those here because, for the most part, the benefit isn't going to outweigh the cost of going Spark -> Arrow in the first place and I don't know that I have enough information to do so in any sort of comprehensive way. Instead, I'll provide one concrete example:

A common case for Arrow is when you need to transfer a table between processes that are on the same machine (or have a very fast I/O channel in between). In that case the cost of serializing to parquet and then deserializing back (Spark must do this to go Spark Dataframe -> Parquet -> Wire -> Parquet -> Spark Dataframe) is more expensive than the I/O saved (Parquet is more compact than Spark Dataframe so you will save some in transmission). If you have a lot of this type of communication it may be beneficial to leave Spark, do these transmissions in Arrow, and then return to Spark.

Recommended topics

Hot tags