About

Spark Async interface for Fold, Reduce, Aggregate?

Asked 31/3, 2015 at 15:45 Answered 27/12, 2017 at 6:11

asynchronous apache-spark future

S

1

8

In the official Spark RDD API:

https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/rdd/AsyncRDDActions.html

count, collect, foreach, and take all have async variants that return a Future.

Why do fold, reduce, and aggregate not have this async/future interface? That seems pretty important.

Smell answered 31/3, 2015 at 15:45 Comment(2)

And saveAsObjectFile – Turnstile 6/4, 2015 at 17:24

Agreed. This is a disturbing inconsistency in the Spark API. If anything, it would make more sense to provide an asynchronous option for fold since it's more general and you could use it to create an asynchronous reduce or count. – Tracheostomy 17/3, 2019 at 23:51

T

0

!!! Edited

@Jan Van den bosch is right (see comments below). This question is not about transformations at all. In case someone else was fooled, I've left my misguided answer below.

!!! Original Answer (incorrect)

TL;DR: The difference is between spark "actions" vs. "transformations": https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-operations

Notice, that all the things you listed with an asynchronous option are spark "actions", which means they will start processing the data right away and attempt to return synchronously. It may take a long time if there's a lot of data, so it's nice to have an asynchronous option.

Meanwhile, the operations you listed without an asynchronous option are spark "transformations" which are lazily evaluated, which means it instantly creates a plan to do the work, but it won't actually process any data until you apply an "action" later to return results.

Meanwhile, do you have specific code or a problem you're trying to solve with this?

Tracheostomy answered 27/12, 2017 at 6:11 Comment(4)

It's not about actions vs transformations, it's about getting the results synchronously (.collect()) vs asynchronously (.collectAsync()). – Bovine 14/3, 2019 at 11:22

@Jan Van den bosh, actually, because they are transformations, they aren't running things at all (sync or async). You would have to do an action to get the result of the transformation anyway. Are you saying you can do rdd.fold(...).collect, but for some reason rdd.fold(...).collectAsync doesn't work? That would be very surprising to me, since the fold and the collect(Aysnc) should independent. – Tracheostomy 14/3, 2019 at 16:48

No, I'm not saying that. This question isn't about transformations at all. It's about having the methods collect + collectAsync, and count + countAsync, but not a reduceAsync, or a foldAsync. – Bovine 15/3, 2019 at 6:48

Sorry, you're right, @Jan Van den bosch. I rarely use spark's fold and reduce methods, so I was was actually thinking about when I use scala's fold or reduce within a spark map or grouping of some sort. – Tracheostomy 17/3, 2019 at 23:47

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.