Scala combinations function not terminating
Asked Answered
T

2

6

I need to generate the combinations for a list of 30,000 items using scalas combinations method on a stream / list

1 to 30000.toStream.combinations(2).size 

This function never completes. When I try the same operation in python

r = list(range(1,30000))
z = itertools.combinations(r, 2)
%time sum(1 for _ in z)

The operation completes in 26.2 seconds.

Whats going on here? How can I generate the combinations of a very large list in scala?

Theodore answered 15/11, 2016 at 15:43 Comment(0)
O
6

I don't know why the implementation in stdlib takes so long. However, this straightforward implementation (specialized to pairs and Lists), is comparable to the Python one:

def combinations2[A](l: List[A]): Iterator[(A, A)] =
  l.tails.flatMap(_ match {
    case h :: t => t.iterator.map((h, _))
    case Nil => Iterator.empty
  })

Then

scala> {
     |   val t0 = System.nanoTime
     |   val res = combinations2((1 to 30000).toList).size
     |   val secs = (System.nanoTime - t0) / 1000000.0
     |   s"$res (computed in $secs seconds)"
     | }
res11: String = 449985000 (computed in 24992.487638 seconds)
Outrageous answered 15/11, 2016 at 17:8 Comment(1)
Note that the combination method specifies that only one occurrence for each combination should be given. So List(1, 2, 2).combinations(2).toList.ordered == List(List(1, 2), List(2, 1)). I don't know whether OP is interested in this property or not, but it will certainly take some time longer to run distinct on your output (although probably still much less time than stdlib's implementation).Bulky
M
7

@TomasMikula provided an alternative, I was interested to see why combinations was inefficient in generating the result.

A quick look using Mission Control and Flight Recorder revealed the problem:

Mission Control

The CombinationItr iterator invokes IndexedSeqOptimized.slice each iteration of next(). ArrayBuilder creates a new builder each time it runs with number of elements it needs to iterate, which means it will allocate 30,000 Array[Int], each of them containing n - 1 elements, causing a total of 11.10GB in a 1 minute sample. This causes massive amounts of GC pressure and is generally not very effecient.

Mauriciomaurie answered 15/11, 2016 at 18:21 Comment(1)
This a great explanation. Thanks a ton.Theodore
O
6

I don't know why the implementation in stdlib takes so long. However, this straightforward implementation (specialized to pairs and Lists), is comparable to the Python one:

def combinations2[A](l: List[A]): Iterator[(A, A)] =
  l.tails.flatMap(_ match {
    case h :: t => t.iterator.map((h, _))
    case Nil => Iterator.empty
  })

Then

scala> {
     |   val t0 = System.nanoTime
     |   val res = combinations2((1 to 30000).toList).size
     |   val secs = (System.nanoTime - t0) / 1000000.0
     |   s"$res (computed in $secs seconds)"
     | }
res11: String = 449985000 (computed in 24992.487638 seconds)
Outrageous answered 15/11, 2016 at 17:8 Comment(1)
Note that the combination method specifies that only one occurrence for each combination should be given. So List(1, 2, 2).combinations(2).toList.ordered == List(List(1, 2), List(2, 1)). I don't know whether OP is interested in this property or not, but it will certainly take some time longer to run distinct on your output (although probably still much less time than stdlib's implementation).Bulky

© 2022 - 2024 — McMap. All rights reserved.