Algorithm to select a single, random combination of values?
Asked Answered
I

7

43

Say I have y distinct values and I want to select x of them at random. What's an efficient algorithm for doing this? I could just call rand() x times, but the performance would be poor if x, y were large.

Note that combinations are needed here: each value should have the same probability to be selected but their order in the result is not important. Sure, any algorithm generating would qualify, but I wonder if it's possible to do this more efficiently without the random order requirement.

How do you efficiently generate a list of K non-repeating integers between 0 and an upper bound N covers this case for permutations.

Illlooking answered 6/3, 2010 at 22:0 Comment(2)
@Jerry Coffin answer does not guarantee randomness, it only guarantees that every element has the same probability of appearing in the ouput. So his answer does not solve your problem. Please check my answer in cr to learn more about this problem and get a working algorithm.Humbertohumble
@BrunoCosta the tag on the question is combinations, the term that means "in no particular order" (as the tag's description says).Eventual
P
72

Robert Floyd invented a sampling algorithm for just such situations. It's generally superior to shuffling then grabbing the first x elements since it doesn't require O(y) storage. As originally written it assumes values from 1..N, but it's trivial to produce 0..N and/or use non-contiguous values by simply treating the values it produces as subscripts into a vector/array/whatever.

In pseuocode, the algorithm runs like this (stealing from Jon Bentley's Programming Pearls column "A sample of Brilliance").

initialize set S to empty
for J := N-M + 1 to N do
    T := RandInt(1, J)
    if T is not in S then
        insert T in S
    else
        insert J in S

That last bit (inserting J if T is already in S) is the tricky part. The bottom line is that it assures the correct mathematical probability of inserting J so that it produces unbiased results.

It's O(x)1 and O(1) with regard to y, O(x) storage.

Note that, in accordance with the tag in the question, the algorithm only guarantees equal probability of each element occuring in the result, not of their relative order in it.


1O(x2) in the worst case for the hash map involved which can be neglected since it's a virtually nonexistent pathological case where all the values have the same hash

Pagel answered 6/3, 2010 at 22:15 Comment(21)
Found it... Communications of the ACM, September 1987, Volume 30, Number 9.Concretize
@Federico: I guess I should have mentioned it, but it's also available in More Programming Pearls: Confessions of a Coder. I strongly recommend it.Pagel
This is really inefficient if Y is large. Imagine trying to select 100 unique integers from 2^32 values.. this algorithm would be terrible. When x is of the same magnitude as y, it's the right method, though. (The method for x<<y is likely random sampling and just checking for duplicates. Though if you must stream items in order, you can use Poisson statistics to advance in larger steps than 1 at a time.)Intendancy
Why is it terrible for M=100, N=2^32? Other than that it's hard to get a uniform random integer in the range 1 .. 2^32 - 100, I mean. It looks fine to me: in the most extreme case of M=1, it just randomly selects a single number from 1 .. N and takes the corresponding element, which is optimal. In fact, if M and N are close together, I'd consider flipping the algorithm over: select N-M elements and then take as your result the set difference of the original set minus that. Reduces the number of calls to RandInt.Bucket
@SPWorley:I'm not sure what issue you think you see, but unless I mis-typed the code, the magnitude of Y make no difference at all. It always makes exactly x calls to RandInt, regardless of Y.Pagel
Any particular set is used, instead of a list? I am only worried because, set might not maintain order. I think order is required.Florid
@Algorist:from the viewpoint of the algorithm, order isn't required. OTOH, each new number requires a lookup in the set, so you want it to be efficient -- a list would be a poor choice (which it almost always is...), but a tree or hash table would work nicely.Pagel
I spent a bit of time on the proof of correctness for practice. I posted it math.stackexchange.com/questions/178690/…Lipcombe
@Kache: Thanks -- that is a nice addition.Pagel
You forgot to say what makes it efficient. My belief is that it is efficient (compared to Fisher-Yates) because it does not require the allocation, initialisation, and [partial] reordering of an array the size of y. This is efficient when x is small, and the cost of is not in is minimal.Dias
As far as I understand this. This won't work if Number of requested elements(m + 1) is near to range length (n), in wich case you need to shuffle the elements.Humbertohumble
@BrunoCosta: This works even if the number of elements you want is equal to the range length (though in that case, shuffling might end up more efficient--but might about as easily not).Pagel
@JerryCoffin No matter how I see it J = 1 if M+1 = n - 1 thus RandInt would generate a number between 1 and 1 which is 1, then it would generate a number between 1, 2 , it would insert 2 because 1 is already there. So on so forth.. If I'm interpreting this wrongly I would be so much glad that you took a bit of your time to explain it in a chat.Humbertohumble
@BrunoCosta: I suppose it depends on what you mean by "works". As implied by the fact that it produces a set for the results, it's more about what numbers are chosen than the order. If you ask it for N numbers from 1 through N, it'll do that (but yes, they'll be produced in order). The order of results will depend on how the Set you use orders its contents.Pagel
@JerryCoffin Set needs to be implemented in such way that it guarantees insertion order. Say that your algorithm entropy was high enough to be considered random for any situation. If Set was implemented with an unknown insertion order (like hash sets are), the entropy would be no longer guaranteed. The same applies to ordered insertion. You are not able to insert items randomly because that would turn this into a self recurrent problem. So this is about how this algorithm works when the M + 1 is near to N. It does not guarantee entropy and thus you won't be able to take elements at "random".Humbertohumble
@NikosM.: Only if you implement it (extremely) poorly. It's fundamentally an O(N) algorithm.Pagel
@JerryCoffin, it is if one assumes that hashmaps (or similar) have strict O(1) fetch-time, which is not the case (can have O(n) fetch-time in worst case). Plus one should take into account average (or worst-case) number of collisionsAggi
Is RandInt(1,J) inclusive?Eventual
@ivan_pozdeev: I suppose that depends on what RandInt function you mean.Pagel
Does this work if N and M are equal? Because in the first round of the for loop the first element is always selected (1 will always be the first element of the returned set S)Chum
Yes, it works for N = M. For this purpose, the order of the sample is irrelevant, and if N = M, the sample must include all numbers from 1 through N. This happens to generate them in order from 1 through N, but regardless of the order in which they're generated, the fundamental result when N = M most always be all numbers from 1 through N.Pagel
B
15

Assuming that you want the order to be random too (or don't mind it being random), I would just use a truncated Fisher-Yates shuffle. Start the shuffle algorithm, but stop once you have selected the first x values, instead of "randomly selecting" all y of them.

Fisher-Yates works as follows:

  • select an element at random, and swap it with the element at the end of the array.
  • Recurse (or more likely iterate) on the remainder of the array, excluding the last element.

Steps after the first do not modify the last element of the array. Steps after the first two don't affect the last two elements. Steps after the first x don't affect the last x elements. So at that point you can stop - the top of the array contains uniformly randomly selected data. The bottom of the array contains somewhat randomized elements, but the permutation you get of them is not uniformly distributed.

Of course this means you've trashed the input array - if this means you'd need to take a copy of it before starting, and x is small compared with y, then copying the whole array is not very efficient. Do note though that if all you're going to use it for in future is further selections, then the fact that it's in somewhat-random order doesn't matter, you can just use it again. If you're doing the selection multiple times, therefore, you may be able to do only one copy at the start, and amortise the cost.

Bucket answered 7/3, 2010 at 1:29 Comment(4)
This assumes you have an input list that you can modify in place by swapping. This is often true, but just as often is not possible.Intendancy
Good point - I indirectly pointed this out already by saying the bottom part had been sort-of-randomized-a-bit, but I've made it explicit.Bucket
see my answer implementing a non-destructive partial fisher-yates-knuth suffle, in strictly O(n) time and spaceAggi
@SPWorley, see my answer which is partial shuffle non-destructiveAggi
K
3

If you really only need to generate combinations - where the order of elements does not matter - you may use combinadics as they are implemented e.g. here by James McCaffrey.

Contrast this with k-permutations, where the order of elements does matter.

In the first case (1,2,3), (1,3,2), (2,1,3), (2,3,1), (3,1,2), (3,2,1) are considered the same - in the latter, they are considered distinct, though they contain the same elements.

In case you need combinations, you may really only need to generate one random number (albeit it can be a bit large) - that can be used directly to find the m th combination. Since this random number represents the index of a particular combination, it follows that your random number should be between 0 and C(n,k). Calculating combinadics might take some time as well.

It might just not worth the trouble - besides Jerry's and Federico's answer is certainly simpler than implementing combinadics. However if you really only need a combination and you are bugged about generating the exact number of random bits that are needed and none more... ;-)

While it is not clear whether you want combinations or k-permutations, here is a C# code for the latter (yes, we could generate only a complement if x > y/2, but then we would have been left with a combination that must be shuffled to get a real k-permutation):

static class TakeHelper
{
    public static IEnumerable<T> TakeRandom<T>(
        this IEnumerable<T> source, Random rng, int count)
    {
        T[] items = source.ToArray();

        count = count < items.Length ? count : items.Length;

        for (int i = items.Length - 1 ; count-- > 0; i--)
        {
            int p = rng.Next(i + 1);
            yield return items[p];
            items[p] = items[i];
        }
    }
}

class Program
{
    static void Main(string[] args)
    {
        Random rnd = new Random(Environment.TickCount);
        int[] numbers = new int[] { 1, 2, 3, 4, 5, 6, 7 };
        foreach (int number in numbers.TakeRandom(rnd, 3))
        {
            Console.WriteLine(number);
        }
    }
}

Another, more elaborate implementation that generates k-permutations, that I had lying around and I believe is in a way an improvement over existing algorithms if you only need to iterate over the results. While it also needs to generate x random numbers, it only uses O(min(y/2, x)) memory in the process:

    /// <summary>
    /// Generates unique random numbers
    /// <remarks>
    /// Worst case memory usage is O(min((emax-imin)/2, num))
    /// </remarks>
    /// </summary>
    /// <param name="random">Random source</param>
    /// <param name="imin">Inclusive lower bound</param>
    /// <param name="emax">Exclusive upper bound</param>
    /// <param name="num">Number of integers to generate</param>
    /// <returns>Sequence of unique random numbers</returns>
    public static IEnumerable<int> UniqueRandoms(
        Random random, int imin, int emax, int num)
    {
        int dictsize = num;
        long half = (emax - (long)imin + 1) / 2;
        if (half < dictsize)
            dictsize = (int)half;
        Dictionary<int, int> trans = new Dictionary<int, int>(dictsize);
        for (int i = 0; i < num; i++)
        {
            int current = imin + i;
            int r = random.Next(current, emax);
            int right;
            if (!trans.TryGetValue(r, out right))
            {
                right = r;
            }
            int left;
            if (trans.TryGetValue(current, out left))
            {
                trans.Remove(current);
            }
            else
            {
                left = current;
            }
            if (r > current)
            {
                trans[r] = left;
            }
            yield return right;
        }
    }

The general idea is to do a Fisher-Yates shuffle and memorize the transpositions in the permutation. It was not published anywhere nor has it received any peer-review whatsoever. I believe it is a curiosity rather than having some practical value. Nonetheless I am very open to criticism and would generally like to know if you find anything wrong with it - please consider this (and adding a comment before downvoting).

Kempe answered 6/3, 2010 at 22:21 Comment(0)
C
1

A little suggestion: if x >> y/2, it's probably better to select at random y - x elements, then choose the complementary set.

Concretize answered 6/3, 2010 at 22:40 Comment(0)
A
0

The trick is to use a variation of shuffle or in other words a partial shuffle.

function random_pick( a, n ) 
{
  N = len(a);
  n = min(n, N);
  picked = array_fill(0, n, 0); backup = array_fill(0, n, 0);
  // partially shuffle the array, and generate unbiased selection simultaneously
  // this is a variation on fisher-yates-knuth shuffle
  for (i=0; i<n; i++) // O(n) times
  { 
    selected = rand( 0, --N ); // unbiased sampling N * N-1 * N-2 * .. * N-n+1
    value = a[ selected ];
    a[ selected ] = a[ N ];
    a[ N ] = value;
    backup[ i ] = selected;
    picked[ i ] = value;
  }
  // restore partially shuffled input array from backup
  // optional step, if needed it can be ignored
  for (i=n-1; i>=0; i--) // O(n) times
  { 
    selected = backup[ i ];
    value = a[ N ];
    a[ N ] = a[ selected ];
    a[ selected ] = value;
    N++;
  }
  return picked;
}

NOTE the algorithm is strictly O(n) in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and non-destructive on the input array (as a partial shuffle would be) but this is optional

adapted from here

update

another approach using only a single call to PRNG (pseudo-random number generator) in [0,1] by IVAN STOJMENOVIC, "ON RANDOM AND ADAPTIVE PARALLEL GENERATION OF COMBINATORIAL OBJECTS" (section 3), of O(N) (worst-case) complexity

enter image description here

Aggi answered 20/8, 2015 at 11:28 Comment(0)
I
0

Here is a simple way to do it which is only inefficient if Y is much larger than X.

void randomly_select_subset(
    int X, int Y,
    const int * inputs, int X, int * outputs
) {
    int i, r;
    for( i = 0; i < X; ++i ) outputs[i] = inputs[i];
    for( i = X; i < Y; ++i ) {
        r = rand_inclusive( 0, i+1 );
        if( r < i ) outputs[r] = inputs[i];
    }
}

Basically, copy the first X of your distinct values to your output array, and then for each remaining value, randomly decide whether or not to include that value.

The random number is further used to choose an element of our (mutable) output array to replace.

Ise answered 12/8, 2016 at 2:33 Comment(0)
P
-1

If, for example, you have 2^64 distinct values, you can use a symmetric key algorithm (with a 64 bits block) to quickly reshuffle all combinations. (for example Blowfish).

for(i=0; i<x; i++)
   e[i] = encrypt(key, i)

This is not random in the pure sense but can be useful for your purpose. If you want to work with arbitrary # of distinct values following cryptographic techniques you can but it's more complex.

Piselli answered 7/3, 2010 at 1:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.