How do you efficiently generate a list of K non-repeating integers between 0 and an upper bound N [duplicate]
Asked Answered
N

13

30

The question gives all necessary data: what is an efficient algorithm to generate a sequence of K non-repeating integers within a given interval [0,N-1]. The trivial algorithm (generating random numbers and, before adding them to the sequence, looking them up to see if they were already there) is very expensive if K is large and near enough to N.

The algorithm provided in Efficiently selecting a set of random elements from a linked list seems more complicated than necessary, and requires some implementation. I've just found another algorithm that seems to do the job fine, as long as you know all the relevant parameters, in a single pass.

Neap answered 1/10, 2008 at 17:21 Comment(4)
Wait, if you already found another algorithm, what's the question?Throb
such a neat algorithm! had to share it with someone - and it seems to be recommended behavior according to the stackoverflow.com/faq: "It's also perfectly fine to ask and answer your own programming question, but pretend you're on JeopardyNeap
Answer to this looks the best to me. #2394746Counterfeit
@Neap I got a carte blanche to narrow the scope at meta.#334825 . Admittedly, I should have mentioned this in edit summary.Caseation
C
12

The random module from Python library makes it extremely easy and effective:

from random import sample
print sample(xrange(N), K)

sample function returns a list of K unique elements chosen from the given sequence.
xrange is a "list emulator", i.e. it behaves like a list of consecutive numbers without creating it in memory, which makes it super-fast for tasks like this one.

Cinemascope answered 1/10, 2008 at 17:47 Comment(5)
The python implementation is quite nice (see svn.python.org/view/python/trunk/Lib/random.py?view=markup, search for "sample"). They distinguish two cases, one for large K (K near N) and one for small K. For large K, they selectively copy elements over. For small K, they draw elements randomly, avoiding repetitions using a set.Neap
This is inefficient in memory for large sequences.Doorsill
hg.python.org/cpython/file/tip/Lib/random.py is the new source link.Caseation
Why not just random.shuffle?Greaves
Answer lacks an explanation - see Jonathans Hartley's comment.Perversity
S
14

In The Art of Computer Programming, Volume 2: Seminumerical Algorithms, Third Edition, Knuth describes the following selection sampling algorithm:

Algorithm S (Selection sampling technique). To select n records at random from a set of N, where 0 < n ≤ N.

S1. [Initialize.] Set t ← 0, m ← 0. (During this algorithm, m represents the number of records selected so far, and t is the total number of input records that we have dealt with.)

S2. [Generate U.] Generate a random number U, uniformly distributed between zero and one.

S3. [Test.] If (N – t)U ≥ n – m, go to step S5.

S4. [Select.] Select the next record for the sample, and increase m and t by 1. If m < n, go to step S2; otherwise the sample is complete and the algorithm terminates.

S5. [Skip.] Skip the next record (do not include it in the sample), increase t by 1, and go back to step S2.

An implementation may be easier to follow than the description. Here is a Common Lisp implementation that select n random members from a list:

(defun sample-list (n list &optional (length (length list)) result)
  (cond ((= length 0) result)
        ((< (* length (random 1.0)) n)
         (sample-list (1- n) (cdr list) (1- length)
                      (cons (car list) result)))
        (t (sample-list n (cdr list) (1- length) result))))

And here is an implementation that does not use recursion, and which works with all kinds of sequences:

(defun sample (n sequence)
  (let ((length (length sequence))
        (result (subseq sequence 0 n)))
    (loop
       with m = 0
       for i from 0 and u = (random 1.0)
       do (when (< (* (- length i) u) 
                   (- n m))
            (setf (elt result m) (elt sequence i))
            (incf m))
       until (= m n))
    result))
Stans answered 9/10, 2008 at 16:2 Comment(1)
Thanks for the authoritative answer. I have the same requirement, and this is the algo I'm planning to implement. Thanks again.Bruns
C
12

The random module from Python library makes it extremely easy and effective:

from random import sample
print sample(xrange(N), K)

sample function returns a list of K unique elements chosen from the given sequence.
xrange is a "list emulator", i.e. it behaves like a list of consecutive numbers without creating it in memory, which makes it super-fast for tasks like this one.

Cinemascope answered 1/10, 2008 at 17:47 Comment(5)
The python implementation is quite nice (see svn.python.org/view/python/trunk/Lib/random.py?view=markup, search for "sample"). They distinguish two cases, one for large K (K near N) and one for small K. For large K, they selectively copy elements over. For small K, they draw elements randomly, avoiding repetitions using a set.Neap
This is inefficient in memory for large sequences.Doorsill
hg.python.org/cpython/file/tip/Lib/random.py is the new source link.Caseation
Why not just random.shuffle?Greaves
Answer lacks an explanation - see Jonathans Hartley's comment.Perversity
G
6

It is actually possible to do this in space proportional to the number of elements selected, rather than the size of the set you're selecting from, regardless of what proportion of the total set you're selecting. You do this by generating a random permutation, then selecting from it like this:

Pick a block cipher, such as TEA or XTEA. Use XOR folding to reduce the block size to the smallest power of two larger than the set you're selecting from. Use the random seed as the key to the cipher. To generate an element n in the permutation, encrypt n with the cipher. If the output number is not in your set, encrypt that. Repeat until the number is inside the set. On average you will have to do less than two encryptions per generated number. This has the added benefit that if your seed is cryptographically secure, so is your entire permutation.

I wrote about this in much more detail here.

Grenada answered 2/10, 2008 at 9:50 Comment(5)
Nice article. But, doesn't "XOR folding" destroy uniqueness? Sure, x != y implies encipher(x) != encipher(y) for decoding to work, but using e.g. (encipher(x) >> 4) ^ (encipher(x) & MASK) instead could "collapse" different x values to the same code -- so your "permutation" might contain repeats.Miceli
I don't have the theoretical basis to hand, but no, it doesn't destroy the 1-to-1 mapping properties of the block cipher. Xor folding is taken from the TEA cipher - perhaps check references on that for more detail.Grenada
@j_random_hacker: Of course, you're right. But it's nevertheless possible to come up with a pseudo random permutation using a custom Feistel cipher using the some cryptographic hash function as function F.Cirillo
see here: https://mcmap.net/q/127493/-unique-non-repeating-random-numbers-in-o-1/…Cirillo
For anyone reading this today, while this method sounds like it could be better, the sample method from random used with range is (in my experiments) actually faster than TEA even if you only use a single cycle. Also, I did occasionally get duplicates when using only v0 as the output. For that experiment, I created a TEA based number generator and initialized and computed 10.000 sets of 2048 numbers and had 6 cases where it generated a duplicate. Maybe multiple cycles would help but even for one cycle it is already slower than random.sample which also guarantees unique numbers.Xavier
N
3

The following code (in C, unknown origin) seems to solve the problem extremely well:

 /* generate N sorted, non-duplicate integers in [0, max] */
 int *generate(int n, int max) {
    int i, m, a;    
    int *g = (int *)calloc(n, sizeof(int));
    if (!g) return 0;

    m = 0;
    for (i = 0; i < max; i++) {
        a = random_in_between(0, max - i);
        if (a < n - m) {
            g[m] = i;
            m++;
        }
    }
    return g;
 }

Does anyone know where I can find more gems like this one?

Neap answered 1/10, 2008 at 17:27 Comment(7)
Programming Pearls by Jon Bentley (the pun on "gems" was intentional). :)Ethereal
What does "random_in_between" stands for?Cupric
This algorithm is terribly inefficient for small sample chosen from a large set. Picking 5 integers from a million takes one million calls to rand() instead of 5.Yonder
Thanks for the booktitle - I couldn't think of any other way to find it. Luis, random_in_between is for 'number between lo and hi, not including hi'. Praptak, perfectly true. Should have specified 'memory efficiency' versus 'time efficiency'. At least it's guaranteed to finish in bounded time...Neap
This is the Knuth's algorithm also described in another answer.Caseation
@RafałDowgird this could be improved by adding an early exit: for (i=0; i<max && m<n; i++) It adds an extra operation each iteration though, thus not so effective with n ≈ max.Caseation
Note this answer returns a sorted list, which was not a requirement of the initial question.Inceptive
H
2

Generate an array 0...N-1 filled a[i] = i.

Then shuffle the first K items.

Shuffling:

  • Start J = N-1
  • Pick a random number 0...J (say, R)
  • swap a[R] with a[J]
    • since R can be equal to J, the element may be swapped with itself
  • subtract 1 from J and repeat.

Finally, take K last elements.

This essentially picks a random element from the list, moves it out, then picks a random element from the remaining list, and so on.

Works in O(K) and O(N) time, requires O(N) storage.

The shuffling part is called Fisher-Yates shuffle or Knuth's shuffle, described in the 2nd volume of The Art of Computer Programming.

Hexahydrate answered 1/10, 2008 at 17:29 Comment(8)
Your approach is fine for generating permutations in [0, N[, but I want numbers in the range [0, K[. For instance, if N=2 and K=10, {5, 9} is a valid output sequence.Neap
Then generate 0 .. K, and then remove numbers randomly until you have N numbers.Throb
This is not uniformly random: since J is moved away from k[J] once, it has a different probability to be picked. E.g. with K=1, N-1 can never be picked.Caseation
@Caseation No. Note that in my example, R is first in the range 0...9, which means it's possible that R=9, and A[9] is swapped with itself.Hexahydrate
Okay, I've got it, but there was a missing 1 in your explanation.Caseation
I did 30M tests, and the differences were about ~0.001%.Caseation
You might be sirprised, but this is Knuth's shuffle, described in another answerCaseation
Well, I knew it had a name, but at the time I'd forgotten what it was.Hexahydrate
E
1

Speed up the trivial algorithm by storing the K numbers in a hashing store. Knowing K before you start takes away all the inefficiency of inserting into a hash map, and you still get the benefit of fast look-up.

Ethereal answered 1/10, 2008 at 17:25 Comment(4)
Yeah, that the way I did it when I needed 10M nonrepeating random numbers for a lotteryKorella
Not too memory-efficient - need a K-sized auxiliary structure. In time, you need K insertions and N removals. The algorithm I found needs only (at most) K random draws.Neap
You don't need an auxiliary structure at all. Just make the map your only structure. You'll always need K insertions to store K items. Why do you need N removals?Ethereal
Inserting into and checking the K-sized data structure isn't where problem with the trivial algo is, it's that as K -> N, your RNG will have a very high probability of generating a number you've already seen before when filling the end of the sequence. You need a hash map, but thats auxilliary.Lambskin
C
1

My solution is C++ oriented, but I'm sure it could be translated to other languages since it's pretty simple.

  • First, generate a linked list with K elements, going from 0 to K
  • Then as long as the list isn't empty, generate a random number between 0 and the size of the vector
  • Take that element, push it into another vector, and remove it from the original list

This solution only involves two loop iterations, and no hash table lookups or anything of the sort. So in actual code:

// Assume K is the highest number in the list
std::vector<int> sorted_list;
std::vector<int> random_list;

for(int i = 0; i < K; ++i) {
    sorted_list.push_back(i);
}

// Loop to K - 1 elements, as this will cause problems when trying to erase
// the first element
while(!sorted_list.size() > 1) {
    int rand_index = rand() % sorted_list.size();
    random_list.push_back(sorted_list.at(rand_index));
    sorted_list.erase(sorted_list.begin() + rand_index);
}                 

// Finally push back the last remaining element to the random list
// The if() statement here is just a sanity check, in case K == 0
if(!sorted_list.empty()) {
    random_list.push_back(sorted_list.at(0));
}
Chaffin answered 1/10, 2008 at 18:17 Comment(0)
S
1

Step 1: Generate your list of integers.
Step 2: Perform Knuth Shuffle.

Note that you don't need to shuffle the entire list, since the Knuth Shuffle algorithm allows you to apply only n shuffles, where n is the number of elements to return. Generating the list will still take time proportional to the size of the list, but you can reuse your existing list for any future shuffling needs (assuming the size stays the same) with no need to preshuffle the partially shuffled list before restarting the shuffling algorithm.

The basic algorithm for Knuth Shuffle is that you start with a list of integers. Then, you swap the first integer with any number in the list and return the current (new) first integer. Then, you swap the second integer with any number in the list (except the first) and return the current (new) second integer. Then...etc...

This is an absurdly simple algorithm, but be careful that you include the current item in the list when performing the swap or you will break the algorithm.

Superorder answered 22/3, 2010 at 19:6 Comment(0)
C
0

The Reservoir Sampling version is pretty simple:

my $N = 20;
my $k;
my @r;

while(<>) {
  if(++$k <= $N) {
    push @r, $_;
  } elsif(rand(1) <= ($N/$k)) {
    $r[rand(@r)] = $_;
  }
}

print @r;

That's $N randomly selected rows from STDIN. Replace the <>/$_ stuff with something else if you're not using rows from a file, but it's a pretty straightforward algorithm.

Capon answered 1/10, 2008 at 18:1 Comment(0)
O
0

If the list is sorted, for example, if you want to extract K elements out of N, but you do not care about their relative order, an efficient algorithm is proposed in the paper An Efficient Algorithm for Sequential Random Sampling (Jeffrey Scott Vitter, ACM Transactions on Mathematical Software, Vol. 13, No. 1, March 1987, Pages 56-67.).

edited to add the code in c++ using boost. I've just typed it and there might be many errors. The random numbers come from the boost library, with a stupid seed, so don't do anything serious with this.

/* Sampling according to [Vitter87].
 * 
 * Bibliography
 * [Vitter 87]
 *   Jeffrey Scott Vitter, 
 *   An Efficient Algorithm for Sequential Random Sampling
 *   ACM Transactions on MAthematical Software, 13 (1), 58 (1987).
 */

#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <string>
#include <iostream>

#include <iomanip>

#include <boost/random/linear_congruential.hpp>
#include <boost/random/variate_generator.hpp>
#include <boost/random/uniform_real.hpp>

using namespace std;

// This is a typedef for a random number generator.
// Try boost::mt19937 or boost::ecuyer1988 instead of boost::minstd_rand
typedef boost::minstd_rand base_generator_type;

    // Define a random number generator and initialize it with a reproducible
    // seed.
    // (The seed is unsigned, otherwise the wrong overload may be selected
    // when using mt19937 as the base_generator_type.)
    base_generator_type generator(0xBB84u);
    //TODO : change the seed above !
    // Defines the suitable uniform ditribution.
    boost::uniform_real<> uni_dist(0,1);
    boost::variate_generator<base_generator_type&, boost::uniform_real<> > uni(generator, uni_dist);



void SequentialSamplesMethodA(int K, int N) 
// Outputs K sorted random integers out of 0..N, taken according to 
// [Vitter87], method A.
    {
    int top=N-K, S, curr=0, currsample=-1;
    double Nreal=N, quot=1., V;

    while (K>=2)
        {
        V=uni();
        S=0;
        quot=top/Nreal;
        while (quot > V)
            {
            S++; top--; Nreal--;
            quot *= top/Nreal;
            }
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        Nreal--; K--;curr++;
        }
    // special case K=1 to avoid overflow
    S=floor(round(Nreal)*uni());
    currsample+=1+S;
    cout << curr << " : " << currsample << "\n";
    }

void SequentialSamplesMethodD(int K, int N)
// Outputs K sorted random integers out of 0..N, taken according to 
// [Vitter87], method D. 
    {
    const int negalphainv=-13; //between -20 and -7 according to [Vitter87]
    //optimized for an implementation in 1987 !!!
    int curr=0, currsample=0;
    int threshold=-negalphainv*K;
    double Kreal=K, Kinv=1./Kreal, Nreal=N;
    double Vprime=exp(log(uni())*Kinv);
    int qu1=N+1-K; double qu1real=qu1;
    double Kmin1inv, X, U, negSreal, y1, y2, top, bottom;
    int S, limit;
    while ((K>1)&&(threshold<N))
        {
        Kmin1inv=1./(Kreal-1.);
        while(1)
            {//Step D2: generate X and U
            while(1)
                {
                X=Nreal*(1-Vprime);
                S=floor(X);
                if (S<qu1) {break;}
                Vprime=exp(log(uni())*Kinv);
                }
            U=uni();
            negSreal=-S;
            //step D3: Accept ?
            y1=exp(log(U*Nreal/qu1real)*Kmin1inv);
            Vprime=y1*(1. - X/Nreal)*(qu1real/(negSreal+qu1real));
            if (Vprime <=1.) {break;} //Accept ! Test [Vitter87](2.8) is true
            //step D4 Accept ?
            y2=0; top=Nreal-1.;
            if (K-1 > S)
                {bottom=Nreal-Kreal; limit=N-S;}
            else {bottom=Nreal+negSreal-1.; limit=qu1;}
            for(int t=N-1;t>=limit;t--)
                {y2*=top/bottom;top--; bottom--;}
            if (Nreal/(Nreal-X)>=y1*exp(log(y2)*Kmin1inv))
                {//Accept !
                Vprime=exp(log(uni())*Kmin1inv);
                break;
                }
            Vprime=exp(log(uni())*Kmin1inv);
            }
        // Step D5: Select the (S+1)th record
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        curr++;
        N-=S+1; Nreal+=negSreal-1.;
        K-=1; Kreal-=1; Kinv=Kmin1inv;
        qu1-=S; qu1real+=negSreal;
        threshold+=negalphainv;
        }
    if (K>1) {SequentialSamplesMethodA(K, N);}
    else {
        S=floor(N*Vprime);
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        }
    }


int main(void)
    {
    int Ntest=10000000, Ktest=Ntest/100;
    SequentialSamplesMethodD(Ktest,Ntest);
    return 0;
    }

$ time ./sampling|tail

gives the following ouptut on my laptop

99990 : 9998882
99991 : 9998885
99992 : 9999021
99993 : 9999058
99994 : 9999339
99995 : 9999359
99996 : 9999411
99997 : 9999427
99998 : 9999584
99999 : 9999745

real    0m0.075s
user    0m0.060s
sys 0m0.000s
Olla answered 21/3, 2010 at 18:26 Comment(2)
According to https://mcmap.net/q/129749/-algorithm-to-select-a-single-random-combination-of-values , this generates combinations. not permutations.Caseation
What was asked was "a list of K non repeating integers" not a permutation. And I specified in my answer "if you are not interested by the order"Halfbaked
V
0

This Ruby code showcases the Reservoir Sampling, Algorithm R method. In each cycle, I select n=5 unique random integers from [0,N=10) range:

t=0
m=0
N=10
n=5
s=0
distrib=Array.new(N,0)
for i in 1..500000 do
 t=0
 m=0
 s=0
 while m<n do

  u=rand()
  if (N-t)*u>=n-m then
   t=t+1
  else 
   distrib[s]+=1
   m=m+1
   t=t+1
  end #if
  s=s+1
 end #while
 if (i % 100000)==0 then puts i.to_s + ". cycle..." end
end #for
puts "--------------"
puts distrib

output:

100000. cycle...
200000. cycle...
300000. cycle...
400000. cycle...
500000. cycle...
--------------
250272
249924
249628
249894
250193
250202
249647
249606
250600
250034

all integer between 0-9 were chosen with nearly the same probability.

It's essentially Knuth's algorithm applied to arbitrary sequences (indeed, that answer has a LISP version of this). The algorithm is O(N) in time and can be O(1) in memory if the sequence is streamed into it as shown in @MichaelCramer's answer.

Villalobos answered 4/9, 2012 at 15:9 Comment(1)
You should be measuring the probability of each complete permutation instead of individual numbers to actually show the method's quality - otherwise, you only show the randomness of number set selection, not of their order.Caseation
M
-1

Here's a way to do it in O(N) without extra storage. I'm pretty sure this is not a purely random distribution, but it's probably close enough for many uses.

/* generate N sorted, non-duplicate integers in [0, max[  in O(N))*/
 int *generate(int n, int max) {
    float step,a,v=0;
    int i;    
    int *g = (int *)calloc(n, sizeof(int));
    if ( ! g) return 0;

    for (i=0; i<n; i++) {
        step = (max-v)/(float)(n-i);
        v+ = floating_pt_random_in_between(0.0, step*2.0);
        if ((int)v == g[i-1]){
          v=(int)v+1;             //avoid collisions
        }
        g[i]=v;
    }
    while (g[i]>max) {
      g[i]=max;                   //fix up overflow
      max=g[i--]-1;
    }
    return g;
 }
Millman answered 1/10, 2008 at 18:16 Comment(0)
B
-3

This is Perl Code. Grep is a filter, and as always I didn't test this code.

@list = grep ($_ % I) == 0, (0..N);
  • I = interval
  • N = Upper Bound

Only get numbers that match your interval via the modulus operator.

@list = grep ($_ % 3) == 0, (0..30);

will return 0, 3, 6, ... 30

This is pseudo Perl code. You may need to tweak it to get it to compile.

Bunin answered 1/10, 2008 at 17:21 Comment(1)
It doesn't seem to perform the task specified.Caseation

© 2022 - 2024 — McMap. All rights reserved.