Why is collect in SparkR so slow?

Asked 19/9, 2016 at 15:23 Answered 19/9, 2016 at 18:19

I have a 500K row spark DataFrame that lives in a parquet file. I'm using spark 2.0.0 and the SparkR package inside Spark (RStudio and R 3.3.1), all running on a local machine with 4 cores and 8gb of RAM.

To facilitate construction of a dataset I can work on in R, I use the collect() method to bring the spark DataFrame into R. Doing so takes about 3 minutes, which is far longer than it'd take to read an equivalently sized CSV file using the data.table package.

Admittedly, the parquet file is compressed and the time needed for decompression could be part of the issue, but I've found other comments on the internet about the collect method being particularly slow, and little in the way of explanation.

I've tried the same operation in sparklyr, and it's much faster. Unfortunately, sparklyr doesn't have the ability to do date path inside joins and filters as easily as SparkR, and so I'm stuck using SparkR. In addition, I don't believe I can use both packages at the same time (i.e. run queries using SparkR calls, and then access those spark objects using sparklyr).

Does anyone have a similar experience, an explanation for the relative slowness of SparkR's collect() method, and/or any solutions?

Weakly answered 19/9, 2016 at 15:23 Comment(0)

@Will

I don't know whether the following comment actually answers your question or not but Spark does lazy operations. All the transformations done in Spark (or SparkR) doesn't really create any data they just create a logical plan to follow.

When you run Actions like collect, it has to fetch data directly from source RDDs (assuming you haven't cached or persisted data).

If your data is not large enough and can be handled by local R easily then there is no need for going with SparkR. Other solution can be to cache your data for frequent uses.

Converse answered 19/9, 2016 at 17:57 Comment(1)

The 500K line example is only one example, and is drawn from tables with 300M rows. Spark is required to make this work in my setup, but the slowness of moving data between Spark and R is a major slowup. – Weakly 20/9, 2016 at 21:1

Short: Serialization/deserialization is very slow. See for example post on my blog http://dsnotes.com/articles/r-read-hdfs However it should be equally slow in both sparkR and sparklyr.

Reichstag answered 19/9, 2016 at 18:19 Comment(3)

Thanks for the explanation and the link. It seems this is a known weakness in the current link between R and Spark, which is less of a problem in Python (but still present). – Weakly 20/9, 2016 at 21:0

In addition, have just verified that what takes SparkR 180 seconds takes sparklyr 9 seconds. So there's something odd going on here. – Weakly 21/9, 2016 at 12:29

Good to know. Will have a look. Mb something new in protocol. – Reichstag 21/9, 2016 at 15:28

Recommended topics

Hot tags