Spark Async interface for Fold, Reduce, Aggregate?
Asked Answered
S

1

8

In the official Spark RDD API:

https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/rdd/AsyncRDDActions.html

count, collect, foreach, and take all have async variants that return a Future.

Why do fold, reduce, and aggregate not have this async/future interface? That seems pretty important.

Smell answered 31/3, 2015 at 15:45 Comment(2)
And saveAsObjectFileTurnstile
Agreed. This is a disturbing inconsistency in the Spark API. If anything, it would make more sense to provide an asynchronous option for fold since it's more general and you could use it to create an asynchronous reduce or count.Tracheostomy
T
0

!!! Edited

@Jan Van den bosch is right (see comments below). This question is not about transformations at all. In case someone else was fooled, I've left my misguided answer below.

!!! Original Answer (incorrect)

TL;DR: The difference is between spark "actions" vs. "transformations": https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-operations


Notice, that all the things you listed with an asynchronous option are spark "actions", which means they will start processing the data right away and attempt to return synchronously. It may take a long time if there's a lot of data, so it's nice to have an asynchronous option.

Meanwhile, the operations you listed without an asynchronous option are spark "transformations" which are lazily evaluated, which means it instantly creates a plan to do the work, but it won't actually process any data until you apply an "action" later to return results.

Meanwhile, do you have specific code or a problem you're trying to solve with this?

Tracheostomy answered 27/12, 2017 at 6:11 Comment(4)
It's not about actions vs transformations, it's about getting the results synchronously (.collect()) vs asynchronously (.collectAsync()).Bovine
@Jan Van den bosh, actually, because they are transformations, they aren't running things at all (sync or async). You would have to do an action to get the result of the transformation anyway. Are you saying you can do rdd.fold(...).collect, but for some reason rdd.fold(...).collectAsync doesn't work? That would be very surprising to me, since the fold and the collect(Aysnc) should independent.Tracheostomy
No, I'm not saying that. This question isn't about transformations at all. It's about having the methods collect + collectAsync, and count + countAsync, but not a reduceAsync, or a foldAsync.Bovine
Sorry, you're right, @Jan Van den bosch. I rarely use spark's fold and reduce methods, so I was was actually thinking about when I use scala's fold or reduce within a spark map or grouping of some sort.Tracheostomy

© 2022 - 2024 — McMap. All rights reserved.