Is ScalaCheck's Gen.pick really random?
Asked Answered
C

2

6

I observed the following unexpected behaviour when using ScalaCheck's Gen.pic, which (for me) indicates that its picking is not quite random, even though its documentation says so:

/** A generator that picks a given number of elements from a list, randomly */

I ran the below three little programs in order (in a span of 2 days, at different times, as it might matter) after setting

implicit override val generatorDrivenConfig = PropertyCheckConfig(
  maxSize = 1000, 
  minSize = 1000, 
  minSuccessful = 1000)

to get a decent sample size.

Program #1

val set = Set(1,2,3,4,5,6,7,8,9,10,
      11,12,13,14,15,16,17,18,19,20,
      21,22,23,24,25,26,27,28,29,30,
      31,32,33,34,35,36,37,38,39,40,
      41,42,43,44,45,46,47,48,49,50)

// Thanks to @Jubobs for the solution
// See: https://mcmap.net/q/1777762/-how-can-i-generate-a-list-of-n-unique-elements-picked-from-a-set
val g = Gen.pick(3, set).map { _.toList }
forAll (g) { s => println(s) }

Out of the 3000 numbers generated at 2 different runs I got a surprisingly similar, and quite non-random distribution (numbers are rounded, only top 5 listed, as for all listing from here on):

  • Number: frequency in run #1, frequency in run #2
  • 15: 33%, 33%
  • 47: 22%, 22%
  • 4: 15%, 16%
  • 19: 10%, 10%
  • 30: 6%, 6%

(Disclaimer: I couldn't find how to create a table here other then this way)

Program 2

val list: List[Int] = List.range(1, 50)
val g = Gen.pick(3, list)
forAll (g) { s => println(s) }

In case of using a List, the numbers seem to get "stuck" at the end of the range (3x1000 numbers in case of both runs):

  • 49: 33%, 33%
  • 48: 22%, 22%
  • 47: 14%, 14%
  • 46: 10%, 10%
  • 45: 6%, 6%

Interestingly, the frequencies are pretty much the same as in the case of Program 1.

Remark: I repeated the runs for lists up to 10 times, and experienced the very same distribution with +/- 1% differences, just didn't want to list all the numbers here in this strange "table" format.

Program 3

Just to spice up things a bit, I ran a third little snippet, creating the Set (Program 1) from a List (Program 2):

val set: Set[Int] = List.range(1, 50).toSet
val g = Gen.pick(3, set).map { _.toList }
forAll (g) { s => println(s) }

Now the numbers are the same as for Program 2 (List wins!), although the frequencies (again, for 3*1000 numbers in 2 runs) got slightly different at the end:

  • 49: 33%, 33%
  • 48: 23%, 22%
  • 47: 16%, 15%
  • 46: 9%, 10%
  • 45: 7%, 6%

Question

Even though the sample size is not enough (as it is never enough) to tell true randomness, I can't help but question Gen.pick's claimed randomness (as far as using it out-of-the-box, I might need to set some seed for it to work "more" randomly), since numbers got "stuck", and frequencies are almost the same.

Upon looking at Gen.pick's source code, at line #672 a certain seed0 is used:

def pick[T](n: Int, l: Iterable[T]): Gen[Seq[T]] = {
    if (n > l.size || n < 0) throw new IllegalArgumentException(s"invalid choice: $n")
    else if (n == 0) Gen.const(Nil)
    else gen { (p, seed0) =>
    // ...

which I can't find defined anywhere else (in Gen.scala source code, or in scala.util.Random documentation), but I have a hunch it might have something to do with the observed behaviour. Is this expected behaviour of Gen.pick? If so, how can I get "more" random picking?

Chew answered 8/5, 2017 at 4:52 Comment(2)
Bugfoot, Not sure if you still care, but I think that @ashawley diagnosis is wrong and this is actually just a bug. See my answer for some detailsJecon
Your reply is accepted as answer now, thanks for going the extra mile.Chew
J
5

Although @ashawley answer is already accepted I don't think it is correct. I think that this is actually a bug and it was introduced by erik-stripe's commit on Sep 1, 2016 and the bug is actually in the line

      val i = (x & 0x7fffffff).toInt % n

and it was supposed to be

      val i = (x & 0x7fffffff).toInt % count

which is still not quite correct.

I also expect that your 33% for the last value is actually 100% and you didn't take into account the fact that you select 3 elements so all your statistics should be multiplied by 3. So for 3-element selection the last element is selected 100% of the time, the previous one - 66.6% and so on, which is even worse than you expected.

Here is again an excerpt from the code:

else gen { (p, seed0) =>
  val buf = ArrayBuffer.empty[T]
  val it = l.iterator
  var seed = seed0
  var count = 0
  while (it.hasNext) {
    val t = it.next
    count += 1
    if (count <= n) {
      buf += t
    } else {
      val (x, s) = seed.long
      val i = (x & 0x7fffffff).toInt % n
      if (i < n) buf(i) = t
      seed = s
    }
  }
  r(Some(buf), seed)
}

So what this code is supposed to do and what it actually does? The if (count <= n) branch fills the output buf with first n elements and after that always the else branch works. To make it more clear I changed the while moving if outside to the following code:

  for (i <- 0 until  n) {
    val t = it.next
    buf += t
  }
  while (it.hasNext) {
    val t = it.next
    val (x, s) = seed.long
    val i = (x & 0x7fffffff).toInt % n
    if (i < n) buf(i) = t
    seed = s
  }

So now it becomes obvious that the else branch should at the same time decide if the current element should be added to the output buf AND which element there it should replace. Obviously, current code always selects each element because if (i < n) is always true given that i is calculated as something % n. And this is why you see such a huge skew to the last elements.

Obviously the plan was to use a modified version of Fisher–Yates shuffle that selects only first n elements of the shuffle and to do it properly you need to select random numbers in the range [0, count) and this is probably why the code is written the way it is written i.e. preserving counter inside while loop.

Using % count is still not quite correct because such a simple way doesn't produce uniform distribution when count is not a power of 2. To be more fair something like

    val c0 = choose(0, count-1)
    val rt: R[Int] = c0.doApply(p, seed)        
    seed = rt.seed      
    val i = rt.retrieve.get // index to swap current element with. Should be fair random number in range [0, count-1], see Fisher–Yates shuffle
    if (i < n) buf(i) = t

or some other way to create i as a fair uniformly distributed random number in such a range should be used.

Update (Why just % count is wrong)

You can look at java.util.Random.nextInt(int) implementation or org.scalacheck.Choose.chLng for an example of how it should be done. It is more complicated than just % count and there is a good reason for it. To illustrate it consider following example. Let's assume that your source random generator generates uniformly random 3-bit values i.e in range just [0, 7] and you want to get ranadom number in range [0, 2] and you do it by simply doing

srcGenerator.nextInt() % 3

Now consider mapping of values in range [0, 7] to your range [0, 2]:

  • 0, 3, 6 will be mapped to 0 (i.e. 3 values are mapped)
  • 1, 4, 7 will be mapped to 1 (i.e. 3 values are mapped)
  • 2, 5 will be mapped to 2 (i.e. only 2 values are mapped)

So if you do just % 3 your distribution will be 0 - 3/8, 1 - 3/8, 2 - 2/8 which is obviously not uniform. This is why those implementations I referenced earlier use some kind of a loop and discard some values generated by the source generator. It is required to produce unifrom distribution.

Jecon answered 8/5, 2017 at 21:57 Comment(19)
Accepted as answer @Jecon as it is the comprehensive explanation with solution proposal. I didn't have enough reputation to explicitly call out the b-word :-)Chew
I linked you and your post in the relevant GitHub issue @ github.com/rickynils/scalacheck/issues/332, if you may allow @JeconChew
Doesn't this just restate the problem, and then re-implement a solution with the same problem?Jig
@ashawley, I'm not sure what you meant, but actually my last piece of code (suggestion) contained a bug with a wrong usage of choose method that I noticed only after your comments. I'm still not sure that my last piece of code is correct because I'm not confident user of ScalaCheck internal data structures, but if you generate i on each step as a fair random number in range [0, count - 1], I expect this solution to generate perfectly random distribution. If you still see some mistakes, please let me know.Jecon
The problem is that in order to get an even distribution, you need to know the length of the collection, but as my answer suggests, that's not possible without consuming the iterable.Jig
@ashawley, I don't think you read the code correctly. pick does fully consume iterable on each attempt to generate next random subset. This is what while (it.hasNext) { val t = it.next ... lines do. Actually it is hard to imagine how you can have a hard skew to the end of the iterable and don't consume it entirelyJecon
No, it turns out (though it isn't obvious immediately) that you don't need to know the length. This is known as Reservoir Sampling, see gregable.com/2007/10/reservoir-sampling.html or en.wikipedia.org/wiki/Reservoir_sampling.Misconduct
Yes, that's the solution I was trying to grasp in my head. Thanks for the reference.Jig
@ashawley, but this (algorihm R from the wiki) is exactly what I expect origial intention of that code was and what my last piece of code suggests.Jecon
Ok, then you should edit your answers and show that code. I don't see it.Jig
@ashawley, well, my last piece of code is what you should put into the else branch of the while loop instead of the line that contains the bug that I put into the first piece of code. I'm not sure what else you expect.Jecon
Looks more complicated than it needs to be. Did you read the reservoir sampling articles?Jig
@ashawley, the only reason it is complicated is that it is hard to generate random(1, i) (or rather random(0, i-1) beacuse Scala arrays are 0-based) in a "ScalaCheck way". This is what first 4 of 5 lines of the code do (my i is j from the wki, my count is i from the wiki and my n is k from the wiki)! Have you looked at what ScalaCheck gen actually requires?Jecon
I believe it should be using % count, which is what you say at first, but then dismiss.Jig
@ashawley, no % count is an improvement but not the right solution if you want to get uniform random distribution. See my update for an example why this is wrong.Jecon
I'm not sure why you think this. You either misunderstand what count is in the code or you don't know what modulo does.Jig
@ashawley, are you sure you understand examlpe in my update "Why just % count is wrong" in the answer? Algorithm R from the wiki requires random(1, i). Don't you at least find it suspicious that established libraries don't implement such a method with a single simple % but use more complicated code for it?Jecon
No, it's not suspicious. Simple is beautiful. It generates a value from [0, Int.MaxValue).Jig
@ashawley, simple is beautiful as long as it is correct! Are you looking at nextInt(int) that is complicated to generate [0, requestedMaximum-1] or just nextInt() which is simple and generates [0, Int.MaxValue]. It takes tricks to map [0, Int.MaxValue] to [0, requestedMaximum-1] to still make it uniformly randomJecon
J
1

I don't think it has anything to do with the seed. It has everything to do with Scalacheck's heurisitic.

There is a subtle bug. Consider what it's doing. It is forcing itself to choose values at the beginning, and than randomly overwriting them after that:

while (it.hasNext) {
  val t = it.next
  count += 1
  if (count <= n) {
    buf += t
  } else {
    val (x, s) = seed.long
    val i = (x & 0x7fffffff).toInt % n
    if (i < n) buf(i) = t
    seed = s
  }
  ...

It is randomly assigning those elements to the result in the else-block, that's why it is giving preference to the tail values.

So, pick is randomly selecting values from a set. However, it has sacrificed picking equally among the values and favors the end of the list because the code is trying to iterate through the list lazily.

To attempt to get an even distribution of picked elements, you would need to know the length of the collection, but as my answer suggests, that's not possible without consuming the iterable twice.

Maybe if you ran reverse on your list, or shuffle, you'd get a better distribution of picks with pick.

Because Scalacheck is a generic property testing library, I predict it can't do either of these things without sacrificing performance for arbitrary-sized collections.

Update

But as Alexey Romanov points out, this should implement the reservoir sampling algorithm which avoids knowing the length and can be run in O(n) time. There is simply a defect in the code. The fix is simply correcting the line for random number generation. It should get a random number from 1 to the kth (count) element visited in the list.

val i = (x & 0x7fffffff).toInt % n

Should be:

val i = (x & 0x7fffffff).toInt % count

I've submitted a PR to Scalacheck:

https://github.com/rickynils/scalacheck/pull/333

Jig answered 8/5, 2017 at 11:38 Comment(3)
You are absolutely right that it is because of "postponing" the picking until the end of the list. Still I can't call the end result random picking... I made an issue at ScalaCheck's GitHub (github.com/rickynils/scalacheck/issues/332), quoting and crediting you if you may allow.Chew
You brought up the b-word, you were on the right track, and credit is given, but @Jecon got the full explanation, so I had to accept his as the answer.Chew
Well, I never thought I deserved the answer. I think you were choosing prematurely.Jig

© 2022 - 2024 — McMap. All rights reserved.