Difference between reduce and foldLeft/fold in functional programming (particularly Scala and Scala APIs)?
Asked Answered
H

4

98

Why do Scala and frameworks like Spark and Scalding have both reduce and foldLeft? So then what's the difference between reduce and fold?

Hotpress answered 6/8, 2014 at 11:7 Comment(2)
See also: https://mcmap.net/q/121636/-scala-fold-vs-foldleft/…Garderobe
Also Difference between fold and reduce revistedSupernal
H
270

reduce vs foldLeft

A big big difference, not mentioned in any other stackoverflow answer relating to this topic clearly, is that reduce should be given a commutative monoid, i.e. an operation that is both commutative and associative. This means the operation can be parallelized.

This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduce even exists. The collection can be chopped up and the reduce can operate on each chunk, then the reduce can operate on the results of each chunk - in fact the level of chunking need not stop one level deep. We could chop up each chunk too. This is why summing integers in a list is O(log N) if given an infinite number of CPUs.

If you just look at the signatures there is no reason for reduce to exist because you can achieve everything you can with reduce with a foldLeft. The functionality of foldLeft is a greater than the functionality of reduce.

But you cannot parallelize a foldLeft, so its runtime is always O(N) (even if you feed in a commutative monoid). This is because it's assumed the operation is not a commutative monoid and so the cumulated value will be computed by a series of sequential aggregations.

foldLeft does not assume commutativity nor associativity. It's associativity that gives the ability to chop up the collection, and it's commutativity that makes cumulating easy because order is not important (so it doesn't matter which order to aggregate each of the results from each of the chunks). Strictly speaking commutativity is not necessary for parallelization, for example distributed sorting algorithms, it just makes the logic easier because you don't need to give your chunks an ordering.

If you have a look at the Spark documentation for reduce it specifically says "... commutative and associative binary operator"

http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD

Here is proof that reduce is NOT just a special case of foldLeft

scala> val intParList: ParSeq[Int] = (1 to 100000).map(_ => scala.util.Random.nextInt()).par

scala> timeMany(1000, intParList.reduce(_ + _))
Took 462.395867 milli seconds

scala> timeMany(1000, intParList.foldLeft(0)(_ + _))
Took 2589.363031 milli seconds

reduce vs fold

Now this is where it gets a little closer to the FP / mathematical roots, and a little trickier to explain. Reduce is defined formally as part of the MapReduce paradigm, which deals with orderless collections (multisets), Fold is formally defined in terms of recursion (see catamorphism) and thus assumes a structure / sequence to the collections.

There is no fold method in Scalding because under the (strict) Map Reduce programming model we cannot define fold because chunks do not have an ordering and fold only requires associativity, not commutativity.

Put simply, reduce works without an order of cumulation, fold requires an order of cumulation and it is that order of cumulation that necessitates a zero value NOT the existence of the zero value that distinguishes them. Strictly speaking reduce should work on an empty collection, because its zero value can by deduced by taking an arbitrary value x and then solving x op y = x, but that doesn't work with a non-commutative operation as there can exist a left and right zero value that are distinct (i.e. x op y != y op x). Of course Scala doesn't bother to work out what this zero value is as that would require doing some mathematics (which are probably uncomputable), so just throws an exception.

It seems (as is often the case in etymology) that this original mathematical meaning has been lost, since the only obvious difference in programming is the signature. The result is that reduce has become a synonym for fold, rather than preserving it's original meaning from MapReduce. Now these terms are often used interchangeably and behave the same in most implementations (ignoring empty collections). Weirdness is exacerbated by peculiarities, like in Spark, that we shall now address.

So Spark does have a fold, but the order in which sub results (one for each partition) are combined (at the time of writing) is the same order in which tasks are completed - and thus non-deterministic. Thanks to @CafeFeed for pointing out that fold uses runJob, which after reading through the code I realised that it's non-deterministic. Further confusion is created by Spark having a treeReduce but no treeFold.

Conclusion

There is a difference between reduce and fold even when applied to non-empty sequences. The former is defined as part of the MapReduce programming paradigm on collections with arbitrary order (http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf) and one ought to assume operators are commutative in addition to being associative to give deterministic results. The latter is defined in terms of catomorphisms and requires that the collections have a notion of sequence (or are defined recursively, like linked lists), thus do not require commutative operators.

In practice due to the unmathematical nature of programming, reduce and fold tend to behave in the same way, either correctly (like in Scala) or incorrectly (like in Spark).

Extra: My Opinion On the Spark API

My opinion is that confusion would be avoided if use of the term fold was completely dropped in Spark. At least spark does have a note in their documentation:

This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala.

Hotpress answered 6/8, 2014 at 11:7 Comment(17)
That is why foldLeft contains the Left in its name and why there is also a method called fold.Austen
@sschaef true, but fold doesn't exist in, say Scalding, because unlike reduce it doesn't require commutativity. I've updated my answer to explain this. Basically the point I'm trying to make is that the difference between fold* and reduce* is very much related to the roots of FP in Category Theory.Hotpress
Do you have an example for distinct (i.e. x op y != y op x) by using reduce and fold to see the difference. I dont think it is correct statement for reduce. For example, I use non-commutative operation "divide" to do the reduce. (List(1000000.0) ::: List.tabulate(100)(_ + 0.001)).reduce(_ / _) For the statement "(i.e. x op y != y op x)." it should give me random result, but it gives me the same result every time, so I think reduce still keep the order of cumulationIcarus
@Cloudtech That is a coincidence of it's single threaded implementation, not within it's specification. On my 4-core machine, if I try adding .par, so (List(1000000.0) ::: List.tabulate(100)(_ + 0.001)).par.reduce(_ / _) I get different results each time.Hotpress
Great answer. I think that "reduce should be given a commutative monoid, i.e. an operation that is both commutative and associative" should read: "reduce should be given a commutative semigroup, i.e. a data structure which supports a combining operation that is both commutative and associative". I don't believe that reduce needs an identity element.Pacification
@AlexDean in the context of computer science, no it doesn't really need an identity as empty collections tend to just throw exceptions. But it's mathematically more elegant (and would be more elegant if collections did this) if the identity element is returned when the collection is empty. In mathematics "throw an exception" doesn't exist.Hotpress
@Hotpress - I see what you're saying, thanks. This is why you have to work with sumOption for Semigroups in Algebird...Pacification
@samthebest: Are your sure about the commutativity? github.com/apache/spark/blob/… says "For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection."Sheathe
To be fair contract of fold on RDD is exactly the same as contract of Scala ParSeq.fold. Although ParSeqs typically (always?) keep the order it is not really a part of the contract. One can easily implement fold (or not so easily foldByKey) which guarantee order of merging and the main reason to do it as it is done now is performance / flexibility of scheduling.Southernmost
@Sheathe I've updated my answer, indeed I assumed that the order of combining was done on the order of partitions (which would be nice, and the logical way to implement it), but it turns out it's based on a more of a "first come first serve" basis, and thus non-deterministic.Hotpress
@samthebest: So, in Spark, all of the combiner methods (reduce, reduceByKey, fold, foldByKey, aggregate, aggregateByKey, combineByKey) need to get functions passed that are both, associative and commutative afterall, right?Sheathe
@Sheathe That's correct, one could write their own reallyFold pimp though, as: rdd.mapPartitions(it => Iterator(it.fold(zero)(f)))).collect().fold(zero)(f), this wouldn't need f to commute.Hotpress
@samthebest: Not sure this works... The collect also gets the partitions first come, first serve. How do you know they are in the right order for the second fold? I tried this with a real example and I got my words back in the correct order with both your code and the original Spark fold. I suspect that within a partition fold in Spark will not need commutativity, but that putting the partitions together is the issue. Your solution does not improve on that, since the collect will have the same issue. What do you think?Sheathe
@Sheathe I read the code for collect, it does preserve order of partitions. Step through it and you'll see this (index, res) => results(index) = res as the result handler - so the result handler uses the partition index to place the result in the ArrayHotpress
@Hotpress so at this point I have a question afaik string concat is non-commutative, when I run this code multiple times List("abc","def","ghi","jk","lmnop","qrs","tuv","wx","yz").par.reduce(_+_) I guess it should give me random result but I'm getting same result everytime.Kicksorter
@Cloudtech @Hotpress I guess in this case of (List(1000000.0) ::: List.tabulate(100)(_ + 0.001)).par.reduce(_ / _) producing different results in each time related with division operation being not associative I mean not related with being non-commutative [docs.scala-lang.org/overviews/parallel-collections/… please look at line starting with "Note: Often, it is thought that.."Kicksorter
Truly hate to break the perfect 256 vote count, but this is an excellent answer!!Orten
F
10

If I am not mistaken, even though the Spark API does not require it, fold also requires for the f to be commutative. Because the order in which the partitions will be aggregated is not assured. For example in the following code only the first print out is sorted:

import org.apache.spark.{SparkConf, SparkContext}

object FoldExample extends App{

  val conf = new SparkConf()
    .setMaster("local[*]")
    .setAppName("Simple Application")
  implicit val sc = new SparkContext(conf)

  val range = ('a' to 'z').map(_.toString)
  val rdd = sc.parallelize(range)

  println(range.reduce(_ + _))
  println(rdd.reduce(_ + _))
  println(rdd.fold("")(_ + _))
}  

Print out:

abcdefghijklmnopqrstuvwxyz

abcghituvjklmwxyzqrsdefnop

defghinopjklmqrstuvabcwxyz

Festive answered 16/3, 2015 at 16:24 Comment(1)
After some back and forth, we believe you are correct. The order of combining is first come first serve. If you run sc.makeRDD(0 to 9, 2).mapPartitions(it => { java.lang.Thread.sleep(new java.util.Random().nextInt(1000)); it } ).map(_.toString).fold("")(_ + _) with 2+ cores several times, I think you will see it produces random (partition-wise) order. I've updated my answer accordingly.Hotpress
F
3

fold in Apache Spark is not the same as fold on not-distributed collections. In fact it requires commutative function to produce deterministic results:

This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection.

This has been shown by Mishael Rosenthal and suggested by Make42 in his comment.

It's been suggested that observed behavior is related to HashPartitioner when in fact parallelize doesn't shuffle and doesn't use HashPartitioner.

import org.apache.spark.sql.SparkSession

/* Note: standalone (non-local) mode */
val master = "spark://...:7077"  

val spark = SparkSession.builder.master(master).getOrCreate()

/* Note: deterministic order */
val rdd = sc.parallelize(Seq("a", "b", "c", "d"), 4).sortBy(identity[String])
require(rdd.collect.sliding(2).forall { case Array(x, y) => x < y })

/* Note: all posible permutations */
require(Seq.fill(1000)(rdd.fold("")(_ + _)).toSet.size == 24)

Explained:

Structure of fold for RDD

def fold(zeroValue: T)(op: (T, T) => T): T = withScope {
  var jobResult: T
  val cleanOp: (T, T) => T
  val foldPartition = Iterator[T] => T
  val mergeResult: (Int, T) => Unit
  sc.runJob(this, foldPartition, mergeResult)
  jobResult
}

is the same as structure of reduce for RDD:

def reduce(f: (T, T) => T): T = withScope {
  val cleanF: (T, T) => T
  val reducePartition: Iterator[T] => Option[T]
  var jobResult: Option[T]
  val mergeResult =  (Int, Option[T]) => Unit
  sc.runJob(this, reducePartition, mergeResult)
  jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
}

where runJob is performed with disregard of partition order and results in need of commutative function.

foldPartition and reducePartition are equivalent in terms of order of processing and effectively (by inheritance and delegation) implemented by reduceLeft and foldLeft on TraversableOnce.

Conclusion: fold on RDD cannot depend on order of chunks and needs commutativity and associativity.

Facility answered 6/8, 2014 at 11:7 Comment(5)
I have to admit that the etymology is confusing and programming literature is lacking in formal definitions. I think it's safe to say that fold on RDDs is indeed really just the same as reduce, but this doesn't respect the root mathematical differences (I've updated my answer to be even more clear). Though I disagree that we really need commutativity provided one is confident whatever their partioner is doing, it's preserving order.Hotpress
Undefined order of fold is not related to partitioning. It is a direct consequence of a runJob implementation.Ponderous
AH! Sorry I couldn't work out what your point was, but having read through the runJob code I see that indeed it does the combining according to when a task is finished, NOT the order of the partitions. It's this key detail that makes everything fall into place. I've edited my answer again and thus corrected the mistake you point out. Please could you either remove your bounty since we are now in agreement?Hotpress
I cannot edit or remove - there is no such option. I can award but I think you get quite a few points from a attention alone, am I wrong? If you confirm that you want me to reward I do it in the next 24 hours. Thanks for corrections and sorry for a method but it looked like you ignore all the warnings, it is a big thing, and answer has been quoted all over the place.Ponderous
How about you award it to @Mishael Rosenthal since he was the first to clearly state the concern. I have no interest in the points, I just like using SO for the SEO and organisation.Hotpress
T
2

One other difference for Scalding is the use of combiners in Hadoop.

Imagine your operation is commutative monoid, with reduce it will be applied on the map side also instead of shuffling/sorting all data to reducers. With foldLeft this is not the case.

pipe.groupBy('product) {
   _.reduce('price -> 'total){ (sum: Double, price: Double) => sum + price }
   // reduce is .mapReduceMap in disguise
}

pipe.groupBy('product) {
   _.foldLeft('price -> 'total)(0.0){ (sum: Double, price: Double) => sum + price }
}

It is always good practice to define your operations as monoid in Scalding.

Triumvir answered 7/8, 2014 at 23:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.