Why is zipped faster than zip in Scala?
Asked Answered
D

4

41

I have written some Scala code to perform an element-wise operation on a collection. Here I defined two methods that perform the same task. One method uses zip and the other uses zipped.

def ES (arr :Array[Double], arr1 :Array[Double]) :Array[Double] = arr.zip(arr1).map(x => x._1 + x._2)

def ES1(arr :Array[Double], arr1 :Array[Double]) :Array[Double] = (arr,arr1).zipped.map((x,y) => x + y)

To compare these two methods in terms of speed, I wrote the following code:

def fun (arr : Array[Double] , arr1 : Array[Double] , f :(Array[Double],Array[Double]) => Array[Double] , itr : Int) ={
  val t0 = System.nanoTime()
  for (i <- 1 to itr) {
       f(arr,arr1)
       }
  val t1 = System.nanoTime()
  println("Total Time Consumed:" + ((t1 - t0).toDouble / 1000000000).toDouble + "Seconds")
}

I call the fun method and pass ES and ES1 as below:

fun(Array.fill(10000)(math.random), Array.fill(10000)(math.random), ES , 100000)
fun(Array.fill(10000)(math.random), Array.fill(10000)(math.random), ES1, 100000)

The results show that the method named ES1 that uses zipped is faster than method ES that uses zip. Based on these observations, I have two questions.

Why is zipped faster than zip?

Is there any even faster way to do element-wise operations on a collection in Scala?

Dysprosium answered 5/1, 2020 at 8:40 Comment(5)
Related question: #59126410Calibre
Because JIT decided to optimize more aggressively the second time it saw 'fun' being invoked. Or because GC decided to clean something up while ES was running. Or because your operating system decided that it had better things to do while your ES test was running. Could be anything, this microbenchmark is just not conclusive.Prevailing
What are the results on your machine? How much faster?Florid
For same population size and configurations, Zipped is taking 32 seconds while Zip is taking 44 secondsDysprosium
Your results are meaningless. Use JMH if you must do micro-benchmarks.Paunch
B
19

To answer your second question:

Is there any more faster way to do element wise operation on a collection in Scala?

The sad truth is that despite it's conciseness, improved productivity, and resilience to bugs, is that functional languages aren't necessarily the most performant. In the OP's example, using higher order functions to define a projection to be executed against collections has overhead, and the tight loop amplifies this. As others have pointed out, additional storage allocation for intermediate and final results will also have overhead.

If performance is critical, although by no means universal, in cases like yours you can unwind concise functional code back into imperative equivalents in order to regain more direct control over memory usage and eliminating function call overheads.

In your specific example, the zipped sums can be performed imperatively by pre-allocating a fixed, mutable array of correct size (since zip stops when one of the collections runs out of elements), and then adding elements at the appropriate index together (since accessing array elements by ordinal index is a very fast operation).

By example, adding a third function, ES3 to your test suite:

def ES3(arr :Array[Double], arr1 :Array[Double]) :Array[Double] = {
   val minSize = math.min(arr.length, arr1.length)
   val array = Array.ofDim[Double](minSize)
   for (i <- 0 to minSize - 1) {
     array(i) = arr(i) + arr1(i)
   }
  array
}

On my i7 I get the following response times:

OP ES Total Time Consumed:23.3747857Seconds
OP ES1 Total Time Consumed:11.7506995Seconds
--
ES3 Total Time Consumed:1.0255231Seconds

Even more heineous would be to do direct in-place mutation of the shorter of the two arrays, which would obviously corrupt the contents of this shorter array, so should only be implemented if the original arrays wouldn't be needed for further work by the caller:

def ES4(arr :Array[Double], arr1 :Array[Double]) :Array[Double] = {
   val minSize = math.min(arr.length, arr1.length)
   val array = if (arr.length < arr1.length) arr else arr1
   for (i <- 0 to minSize - 1) {
      array(i) = arr(i) + arr1(i)
   }
  array
}

Total Time Consumed:0.3542098Seconds

But obviously, direct mutation of arrays elements passed as parameters isn't in the spirit of Scala - this code smells of side effects.

To be honest, if you require this level of performance optimization in tight loops, you're likely better off writing these kinds of algorithms in Rust, Go or C.

Bellyache answered 5/1, 2020 at 9:41 Comment(11)
I tested this and this version of code is much faster than mine. One thing to ask, we can not parallelize for loop. So what will be behavior of your code on cluster of 4 nodes.Dysprosium
There's nothing parallelized in my code above. Although this specific problem is parallelizable (since multiple threads could work on different sections of the arrays), there wouldn't be much point in such a simple operation on only 10k elements - the overhead of creating and synchronizing new threads would likely outweigh any benefit. To be honest, if you're require this level of performance optimization, you're likely better off writing these kinds of algorithms in Rust, Go or C.Bellyache
It will be more scala-like and faster to use Array.tabulate(minSize)(i => arr(i) + arr1(i)) to create your arrayGallinacean
@SarveshKumarSingh this is much slower one. Takes almost 9 secondsDysprosium
@SarveshKumarSingh although concise and elegant in respect of code, because of the lambda parameter, this will likely still require a stacked function call in the tight loop, causing the noted degradation. Performance would only be equivalent if the higher order function is somehow unwrapped and inlined.Bellyache
Array.tabulate should be much faster than either zip or zipped here (and is in my benchmarks).Kaon
@Bellyache "Performance would only be equivalent if the higher order function is somehow unwrapped and inlined." This isn't really accurate. Even your for is desugared to a higher-order function call (foreach). The lambda will only be instantiated once in both cases.Kaon
@TravisBrown yes, but the lambda (closure) will be invoked 10000 times in the inner loop, which I believe will be a function call someAnonLambda.invoke(i) , with i being stacked, and arr and arr1 being captured in the closure, likely resulting in additional heap-wrapped object. In order to avoid the overhead to get equivalent performance of our imperative loops, the lambda would need to be 'unpacked' and inlined. Would need to look at the byte code, and even the assembly to confirm, of course.Bellyache
Ok, I think I see why we are missing eachother.@SarveshKumarSingh seemingly indicated that tabulate would be faster than an imperative loop, which I and OP immediately challenged (and you've since proven to be ~3 times slower). Whereas presumably @Sarvesh and yourself are referring to zip et al as a frame of referenceBellyache
@Bellyache That seemingly resulted in some confusion. I don't think the tabulate method can be faster than for-loop because of the involved function and lambda-context-capture overhead but its both faster and more "scala-like" approach than the given zip and zipped.Gallinacean
@Bellyache Ah, makes sense, thanks for clarifying. :)Kaon
K
54

None of the other answers mention the primary reason for the difference in speed, which is that the zipped version avoids 10,000 tuple allocations. As a couple of the other answers do note, the zip version involves an intermediate array, while the zipped version doesn't, but allocating an array for 10,000 elements isn't what makes the zip version so much worse—it's the 10,000 short-lived tuples that are being put into that array. These are represented by objects on the JVM, so you're doing a bunch of object allocations for things that you're immediately going to throw away.

The rest of this answer just goes into a little more detail about how you can confirm this.

Better benchmarking

You really want to be using a framework like jmh to do any kind of benchmarking responsibly on the JVM, and even then the responsibly part is hard, although setting up jmh itself isn't too bad. If you have a project/plugins.sbt like this:

addSbtPlugin("pl.project13.scala" % "sbt-jmh" % "0.3.7")

And a build.sbt like this (I'm using 2.11.8 since you mention that's what you're using):

scalaVersion := "2.11.8"

enablePlugins(JmhPlugin)

Then you can write your benchmark like this:

package zipped_bench

import org.openjdk.jmh.annotations._

@State(Scope.Benchmark)
@BenchmarkMode(Array(Mode.Throughput))
class ZippedBench {
  val arr1 = Array.fill(10000)(math.random)
  val arr2 = Array.fill(10000)(math.random)

  def ES(arr: Array[Double], arr1: Array[Double]): Array[Double] =
    arr.zip(arr1).map(x => x._1 + x._2)

  def ES1(arr: Array[Double], arr1: Array[Double]): Array[Double] =
    (arr, arr1).zipped.map((x, y) => x + y)

  @Benchmark def withZip: Array[Double] = ES(arr1, arr2)
  @Benchmark def withZipped: Array[Double] = ES1(arr1, arr2)
}

And run it with sbt "jmh:run -i 10 -wi 10 -f 2 -t 1 zipped_bench.ZippedBench":

Benchmark                Mode  Cnt     Score    Error  Units
ZippedBench.withZip     thrpt   20  4902.519 ± 41.733  ops/s
ZippedBench.withZipped  thrpt   20  8736.251 ± 36.730  ops/s

Which shows that the zipped version gets about 80% more throughput, which is probably more or less the same as your measurements.

Measuring allocations

You can also ask jmh to measure allocations with -prof gc:

Benchmark                                                 Mode  Cnt        Score       Error   Units
ZippedBench.withZip                                      thrpt    5     4894.197 ±   119.519   ops/s
ZippedBench.withZip:·gc.alloc.rate                       thrpt    5     4801.158 ±   117.157  MB/sec
ZippedBench.withZip:·gc.alloc.rate.norm                  thrpt    5  1080120.009 ±     0.001    B/op
ZippedBench.withZip:·gc.churn.PS_Eden_Space              thrpt    5     4808.028 ±    87.804  MB/sec
ZippedBench.withZip:·gc.churn.PS_Eden_Space.norm         thrpt    5  1081677.156 ± 12639.416    B/op
ZippedBench.withZip:·gc.churn.PS_Survivor_Space          thrpt    5        2.129 ±     0.794  MB/sec
ZippedBench.withZip:·gc.churn.PS_Survivor_Space.norm     thrpt    5      479.009 ±   179.575    B/op
ZippedBench.withZip:·gc.count                            thrpt    5      714.000              counts
ZippedBench.withZip:·gc.time                             thrpt    5      476.000                  ms
ZippedBench.withZipped                                   thrpt    5    11248.964 ±    43.728   ops/s
ZippedBench.withZipped:·gc.alloc.rate                    thrpt    5     3270.856 ±    12.729  MB/sec
ZippedBench.withZipped:·gc.alloc.rate.norm               thrpt    5   320152.004 ±     0.001    B/op
ZippedBench.withZipped:·gc.churn.PS_Eden_Space           thrpt    5     3277.158 ±    32.327  MB/sec
ZippedBench.withZipped:·gc.churn.PS_Eden_Space.norm      thrpt    5   320769.044 ±  3216.092    B/op
ZippedBench.withZipped:·gc.churn.PS_Survivor_Space       thrpt    5        0.360 ±     0.166  MB/sec
ZippedBench.withZipped:·gc.churn.PS_Survivor_Space.norm  thrpt    5       35.245 ±    16.365    B/op
ZippedBench.withZipped:·gc.count                         thrpt    5      863.000              counts
ZippedBench.withZipped:·gc.time                          thrpt    5      447.000                  ms

…where gc.alloc.rate.norm is probably the most interesting part, showing that the zip version is allocating over three times as much as zipped.

Imperative implementations

If I knew that this method was going to be called in extremely performance-sensitive contexts, I'd probably implement it like this:

  def ES3(arr: Array[Double], arr1: Array[Double]): Array[Double] = {
    val minSize = math.min(arr.length, arr1.length)
    val newArr = new Array[Double](minSize)
    var i = 0
    while (i < minSize) {
      newArr(i) = arr(i) + arr1(i)
      i += 1
    }
    newArr
  }

Note that unlike the optimized version in one of the other answers, this uses while instead of a for since the for will still desugar into Scala collections operations. We can compare this implementation (withWhile), the other answer's optimized (but not in-place) implementation (withFor), and the two original implementations:

Benchmark                Mode  Cnt       Score      Error  Units
ZippedBench.withFor     thrpt   20  118426.044 ± 2173.310  ops/s
ZippedBench.withWhile   thrpt   20  119834.409 ±  527.589  ops/s
ZippedBench.withZip     thrpt   20    4886.624 ±   75.567  ops/s
ZippedBench.withZipped  thrpt   20    9961.668 ± 1104.937  ops/s

That's a really huge difference between the imperative and functional versions, and all of these method signatures are exactly identical and the implementations have the same semantics. It's not like the imperative implementations are using global state, etc. While the zip and zipped versions are more readable, I personally don't think there's any sense in which the imperative versions are against the "spirit of Scala", and I wouldn't hesitate to use them myself.

With tabulate

Update: I added a tabulate implementation to the benchmark based on a comment in another answer:

def ES4(arr: Array[Double], arr1: Array[Double]): Array[Double] = {
  val minSize = math.min(arr.length, arr1.length)
  Array.tabulate(minSize)(i => arr(i) + arr1(i))
}

It's much faster than the zip versions, although still much slower than the imperative ones:

Benchmark                  Mode  Cnt      Score     Error  Units
ZippedBench.withTabulate  thrpt   20  32326.051 ± 535.677  ops/s
ZippedBench.withZip       thrpt   20   4902.027 ±  47.931  ops/s

This is what I'd expect, since there's nothing inherently expensive about calling a function, and because accessing array elements by index is very cheap.

Kaon answered 5/1, 2020 at 14:9 Comment(0)
B
19

To answer your second question:

Is there any more faster way to do element wise operation on a collection in Scala?

The sad truth is that despite it's conciseness, improved productivity, and resilience to bugs, is that functional languages aren't necessarily the most performant. In the OP's example, using higher order functions to define a projection to be executed against collections has overhead, and the tight loop amplifies this. As others have pointed out, additional storage allocation for intermediate and final results will also have overhead.

If performance is critical, although by no means universal, in cases like yours you can unwind concise functional code back into imperative equivalents in order to regain more direct control over memory usage and eliminating function call overheads.

In your specific example, the zipped sums can be performed imperatively by pre-allocating a fixed, mutable array of correct size (since zip stops when one of the collections runs out of elements), and then adding elements at the appropriate index together (since accessing array elements by ordinal index is a very fast operation).

By example, adding a third function, ES3 to your test suite:

def ES3(arr :Array[Double], arr1 :Array[Double]) :Array[Double] = {
   val minSize = math.min(arr.length, arr1.length)
   val array = Array.ofDim[Double](minSize)
   for (i <- 0 to minSize - 1) {
     array(i) = arr(i) + arr1(i)
   }
  array
}

On my i7 I get the following response times:

OP ES Total Time Consumed:23.3747857Seconds
OP ES1 Total Time Consumed:11.7506995Seconds
--
ES3 Total Time Consumed:1.0255231Seconds

Even more heineous would be to do direct in-place mutation of the shorter of the two arrays, which would obviously corrupt the contents of this shorter array, so should only be implemented if the original arrays wouldn't be needed for further work by the caller:

def ES4(arr :Array[Double], arr1 :Array[Double]) :Array[Double] = {
   val minSize = math.min(arr.length, arr1.length)
   val array = if (arr.length < arr1.length) arr else arr1
   for (i <- 0 to minSize - 1) {
      array(i) = arr(i) + arr1(i)
   }
  array
}

Total Time Consumed:0.3542098Seconds

But obviously, direct mutation of arrays elements passed as parameters isn't in the spirit of Scala - this code smells of side effects.

To be honest, if you require this level of performance optimization in tight loops, you're likely better off writing these kinds of algorithms in Rust, Go or C.

Bellyache answered 5/1, 2020 at 9:41 Comment(11)
I tested this and this version of code is much faster than mine. One thing to ask, we can not parallelize for loop. So what will be behavior of your code on cluster of 4 nodes.Dysprosium
There's nothing parallelized in my code above. Although this specific problem is parallelizable (since multiple threads could work on different sections of the arrays), there wouldn't be much point in such a simple operation on only 10k elements - the overhead of creating and synchronizing new threads would likely outweigh any benefit. To be honest, if you're require this level of performance optimization, you're likely better off writing these kinds of algorithms in Rust, Go or C.Bellyache
It will be more scala-like and faster to use Array.tabulate(minSize)(i => arr(i) + arr1(i)) to create your arrayGallinacean
@SarveshKumarSingh this is much slower one. Takes almost 9 secondsDysprosium
@SarveshKumarSingh although concise and elegant in respect of code, because of the lambda parameter, this will likely still require a stacked function call in the tight loop, causing the noted degradation. Performance would only be equivalent if the higher order function is somehow unwrapped and inlined.Bellyache
Array.tabulate should be much faster than either zip or zipped here (and is in my benchmarks).Kaon
@Bellyache "Performance would only be equivalent if the higher order function is somehow unwrapped and inlined." This isn't really accurate. Even your for is desugared to a higher-order function call (foreach). The lambda will only be instantiated once in both cases.Kaon
@TravisBrown yes, but the lambda (closure) will be invoked 10000 times in the inner loop, which I believe will be a function call someAnonLambda.invoke(i) , with i being stacked, and arr and arr1 being captured in the closure, likely resulting in additional heap-wrapped object. In order to avoid the overhead to get equivalent performance of our imperative loops, the lambda would need to be 'unpacked' and inlined. Would need to look at the byte code, and even the assembly to confirm, of course.Bellyache
Ok, I think I see why we are missing eachother.@SarveshKumarSingh seemingly indicated that tabulate would be faster than an imperative loop, which I and OP immediately challenged (and you've since proven to be ~3 times slower). Whereas presumably @Sarvesh and yourself are referring to zip et al as a frame of referenceBellyache
@Bellyache That seemingly resulted in some confusion. I don't think the tabulate method can be faster than for-loop because of the involved function and lambda-context-capture overhead but its both faster and more "scala-like" approach than the given zip and zipped.Gallinacean
@Bellyache Ah, makes sense, thanks for clarifying. :)Kaon
C
9

Consider lazyZip

(as lazyZip bs) map { case (a, b) => a + b }

instead of zip

(as zip bs) map { case (a, b) => a + b }

Scala 2.13 added lazyZip in favour of .zipped

Together with .zip on views, this replaces .zipped (now deprecated). (scala/collection-strawman#223)

zipped (and hence lazyZip) is faster than zip because, as explained by Tim and Mike Allen, zip followed by map will result in two separate transformations due to strictness, whilst zipped followed by map will result in a single transformation executed in one go due to laziness.

zipped gives Tuple2Zipped, and analysing Tuple2Zipped.map,

class Tuple2Zipped[...](val colls: (It1, It2)) extends ... {
  private def coll1 = colls._1
  private def coll2 = colls._2

  def map[...](f: (El1, El2) => B)(...) = {
    val b = bf.newBuilder(coll1)
    ...
    val elems1 = coll1.iterator
    val elems2 = coll2.iterator

    while (elems1.hasNext && elems2.hasNext) {
      b += f(elems1.next(), elems2.next())
    }

    b.result()
  }

we see the two collections coll1 and coll2 are iterated over and on each iteration the function f passed to map is applied along the way

b += f(elems1.next(), elems2.next())

without having to allocate and transform intermediary structures.


Applying Travis' benchmarking method, here is a comparison between new lazyZip and deprecated zipped where

@State(Scope.Benchmark)
@BenchmarkMode(Array(Mode.Throughput))
class ZippedBench {
  import scala.collection.mutable._
  val as = ArraySeq.fill(10000)(math.random)
  val bs = ArraySeq.fill(10000)(math.random)

  def lazyZip(as: ArraySeq[Double], bs: ArraySeq[Double]): ArraySeq[Double] =
    as.lazyZip(bs).map{ case (a, b) => a + b }

  def zipped(as: ArraySeq[Double], bs: ArraySeq[Double]): ArraySeq[Double] =
    (as, bs).zipped.map { case (a, b) => a + b }

  def lazyZipJavaArray(as: Array[Double], bs: Array[Double]): Array[Double] =
    as.lazyZip(bs).map{ case (a, b) => a + b }

  @Benchmark def withZipped: ArraySeq[Double] = zipped(as, bs)
  @Benchmark def withLazyZip: ArraySeq[Double] = lazyZip(as, bs)
  @Benchmark def withLazyZipJavaArray: ArraySeq[Double] = lazyZipJavaArray(as.toArray, bs.toArray)
}

gives

[info] Benchmark                          Mode  Cnt      Score      Error  Units
[info] ZippedBench.withZipped            thrpt   20  20197.344 ± 1282.414  ops/s
[info] ZippedBench.withLazyZip           thrpt   20  25468.458 ± 2720.860  ops/s
[info] ZippedBench.withLazyZipJavaArray  thrpt   20   5215.621 ±  233.270  ops/s

lazyZip seems to perform a bit better than zipped on ArraySeq. Interestingly, notice significantly degraded performance when using lazyZip on Array.

Calibre answered 5/1, 2020 at 11:41 Comment(1)
lazyZip is available in Scala 2.13.1. Currently I am using Scala 2.11.8Dysprosium
B
5

You should always be cautious with performance measurement because of JIT compilation, but a likely reason is that zipped is lazy and extracts elements from the original Array vaules during the map call, whereas zip creates a new Array object and then calls map on the new object.

Buckles answered 5/1, 2020 at 9:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.