Generating a random, non-repeating sequence of all integers in .NET
Asked Answered
S

8

33

Is there a way in .NET to generate a sequence of all the 32-bit integers (Int32) in random order, without repetitions, and in a memory-efficient manner? Memory-efficient would mean using a maximum of just a few hundred mega bytes of main memory.

Ideally the sequence should be something like an IEnumerable<int>, and it lazily returns the next number in sequence, only when requested.

I did some quick research and I found some partial solutions to this:

  • Using a maximal linear feedback shift register - if I understood correctly, it only generates the numbers in increasing sequence and it does not cover the entire range
  • Using the Fisher-Yates or other shuffling algorithms over collections - this would violate the memory restrictions given the large range
  • Maintaining a set-like collection and keep generating a random integer (perhaps using Random) until it doesn't repeat, i.e. it's not in the set - apart from possibly failing to satisfy the memory requirements, it would get ridiculously slow when generating the last numbers in the sequence.
  • Random permutations over 32 bits, however I can't think of a way to ensure non-repeatability.

Is there another way to look at this problem - perhaps taking advantage of the fixed range of values - that would give a solution satisfying the memory requirements? Maybe the .NET class libraries come with something useful?

UPDATE 1

Thanks everyone for your insights and creative suggestions for a solution. I'll try to implement and test soon (both for correctness and memory efficiency) the 2 or 3 most promising solutions proposed here, post the results and then pick a "winner".

UPDATE 2

I tried implementing hvd's suggestion in the comment below. I tried using both the BitArray in .NET and my custom implementation, since the .NET one is limited to int.MaxValue entries, so not enough to cover the entire range of integers.

I liked the simplicity of the idea and i was willing to "sacrifice" those 512 MB of memory if it worked fine. Unfortunately, the run time is quite slow, spending up to tens of seconds to generate the next random number on my machine, which has a 3.5 GHz Core i7 CPU. So unfortunately this is not acceptable if you request many, many random numbers to be generated. I guess it's predictable though, it's a O(M x N) algorithm if i'm not mistaken, where N is 2^32 and M is the number of requested integers, so all those iterations take their toll.

Ideally i'd like to generate the next random number in O(1) time, while still meeting the memory requirements, maybe the next algorithms suggested here are suited for this. I'll give them a try as soon as I can.

UPDATE 3

I just tested the Linear Congruential Generator and i can say i'm quite pleased with the results. It looks like a strong contender for the winner position in this thread.

Correctness: all integers generated exactly once (i used a bit vector to check this).

Randomness: fairly good.

Memory usage: Excellent, just a few bytes.

Run time: Generates the next random integer very fast, as you can expect from an O(1) algorithm. Generating every integer took a total of approx. 11 seconds on my machine.

All in all i'd say it's a very appropriate technique if you're not looking for highly randomized sequences.

UPDATE 4

The modular multiplicative inverse technique described below behaves quite similarly to the LCG technique - not surprising, since both are based on modular arithmetic -, although i found it a bit less straightforward to implement in order to yield satisfactorily random sequences.

One interesting difference I found is that this technique seems faster than LCG: generating the entire sequence took about 8 seconds, versus 11 seconds for LCG. Other than this, all other remarks about memory efficiency, correctness and randomness are the same.

UPDATE 5

Looks like user TomTom deleted their answer with the Mersenne Twister without notice after i pointed out in a comment that i found out that it generates duplicate numbers sooner than required. So i guess this rules out the Mersenne Twister entirely.

UPDATE 6

I tested another suggested technique that looks promising, Skip32, and while i really liked the quality of the random numbers, the algorithm is not suitable for generating the entire range of integers in an acceptable amount of time. So unfortunately it falls short when compared to the other techniques that were able to finish the process. I used the implementation in C# from here, by the way - i changed the code to reduce the number of rounds to 1 but it still can't finish in a timely manner.

After all, judging by the results described above, my personal choice for the solution goes to the modular multiplicative inverses technique, followed very closely by the linear congruential generator. Some may argue that this is inferior in certain aspects to other techniques, but given my original constraints i'd say it fits them best.

Snowplow answered 30/1, 2016 at 12:59 Comment(19)
I think you pretty much have an answer right in your question, you're just not seeing it. Maintaining a set-like collection is possible with a bit array. It requires 2**32 bits, which is 512 MB, and seems on the high end but still within your requirement of a few hundred mega bytes. Generating a random number that isn't in that set doesn't need to be trial and error: if you have N numbers remaining, generate a random number from 0 to N-1, and then pick the N-th remaining number.Afteryears
@hvd: I was thinking at some point at bit arrays as well, i knew it would take less space than integer arrays/collections, but it didn't occur to me how i could populate that array, thanks for the insight! As for that 512 MB lower bound, i'd say it's still questionable - think of older PCs, such as ones in the '90s era, when such an amount of memory was a luxury.Snowplow
@Bjørn-RogerKringsjå: yes, all of them, in the entire range.Snowplow
This seems to be a special case of this question: #10055232Higdon
@usr: You're right, it's basically the same question but asked in a more mathematical-oriented way; have you eventually managed to sort through the details and implement it? I assume you tried the LFSR technique.Snowplow
I didn't implement any of that. If I remember correctly I went with the fisher yates shuffle and accepted the wasteful memory usage because this was not production code.Higdon
Does it have to be random each time, or is the same result which satisfies your conditions each time permitted?Pol
@WaiHaLee: We're not talking about a security/crypto situation here, so I'd say the same randomised result is permitted each time if it better satisfies the memory requirements. But ideally it should give a new random sequence each time.Snowplow
@GabrielS. - if you're not talking security/crypto, would you permit being able to deterministically calculate one number from the previous values?Pol
@WaiHaLee: I'd say yes, as long as the end result is not an obviously increasing or decreasing sequence of numbers (see my remark about the shift register).Snowplow
What does RANDOM mean? Random to which standard? Pseudeo-Random - implement or download a mersenne-twister. Good enough for 99.9% of the applications.Slily
@TomTom: Pseudo-Random is just fine; i just don't want a blatantly obvious relationship between numbers, no ordered sequences, etc.Snowplow
@GabrielS. See my answer - a mersenne twister works perfect as long as you do not ask for too many numbers (for 8 bit it will repeat after 255 or 256 numbers - because it HAS to). It is small and efficient.Slily
@GabrielS. - one last thing - would you permit repetition (of the whole sequence) once all the numbers are exhausted?Pol
@WaiHaLee: maybe I don't understand your question correctly, but wouldn't that mean that the generated random numbers are repeated? Thus violating my non-repeating requirement.Snowplow
@GabrielS. - what I mean is: there are 2^32 ~= 4x10^9 distinct values. Once all those have been returned, you must, by definition, start returning duplicates. Are you happy for the sequence to repeat with a period of 2^32?Pol
@WaiHaLee: Yes, of course, I thought first you're talking about a much lower period and the duplicates start showing sooner than they should.Snowplow
Related question: How to obfuscate an integer?Alexine
"it would get ridiculously slow when generating the last numbers in the sequence" while this is true, the total runtime is still only n log(n). Keeping track of duplicates using a bitfield would take 2^32 bits = 512 MiB, so it might not fail your memory requirements.Alexine
P
8

Is there a way in .NET

Actually, this can be done in most any language

to generate a sequence of all the 32-bit integers (Int32)

Yes.

in random order,

Here we need to agree on terminology since "random" is not what most people think it is. More on this in a moment.

without repetitions,

Yes.

and in a memory-efficient manner?

Yes.

Memory-efficient would mean using a maximum of just a few hundred mega bytes of main memory.

Ok, so would using almost no memory be acceptable? ;-)

Before getting to the suggestion, we need to clear up the matter of "randomness". Something that is truly random has no discernible pattern. Hence, running the algorithm millions of times in a row could theoretically return the same value across all iterations. If you throw in the concept of "must be different from the prior iteration", then it is no longer random. However, looking at all of the requirements together, it seems that all that is really being asked for is "differing patterns of distribution of the integers". And this is doable.

So how to do this efficiently? Make use of Modular multiplicative inverses. I used this to answer the following Question which had a similar requirement to generate non-repeating, pseudo-random, sample data within certain bounds:

Generate different random time in the given interval

I first learned about this concept here ( generate seemingly random unique numeric ID in SQL Server ) and you can use either of the following online calculators to determine your "Integer" and "Modular Multiplicative Inverses (MMI)" values:

Applying that concept here, you would use Int32.MaxSize as the Modulo value.

This would give a definite appearance of random distribution with no chance for collisions and no memory needed to store already used values.

The only initial problem is that the pattern of distribution is always the same given the same "Integer" and "MMI" values. So, you could come up with differing patterns by either adding a "randomly" generated Int to the starting value (as I believe I did in my answer about generating the sample data in SQL Server) or you can pre-generate several combinations of "Integer" and corresponding "MMI" values, store those in a config file / dictionary, and use a .NET random function to pick one at the start of each run. Even if you store 100 combinations, that is almost no memory use (assuming it is not in a config file). In fact, if storing both as Int and the dictionary uses Int as an index, then 1000 values is approximately 12k?


UPDATE

Notes:

  • There is a pattern in the results, but it is not discernible unless you have enough of them at any given moment to look at in total. For most use-cases, this is acceptable since no recipient of the values would have a large collection of them, or know that they were assigned in sequence without any gaps (and that knowledge is required in order to determine if there is a pattern).
  • Only 1 of the two variable values -- "Integer" and "Modular Multiplicative Inverse (MMI)" -- is needed in the formula for a particular run. Hence:
    • each pair gives two distinct sequences
    • if maintaining a set in memory, only a simple array is needed, and assuming that the array index is merely an offset in memory from the base address of the array, then the memory required should only be 4 bytes * capacity (i.e. 1024 options is only 4k, right?)

Here is some test code. It is written in T-SQL for Microsoft SQL Server since that is where I work primarily, and it also has the advantage of making it real easy-like to test for uniqueness, min and max values, etc, without needing to compile anything. The syntax will work in SQL Server 2008 or newer. For SQL Server 2005, initialization of variables had not been introduced yet so each DECLARE that contains an = would merely need to be separated into the DECLARE by itself and a SET @Variable = ... for however that variable is being initialized. And the SET @Index += 1; would need to become SET @Index = @Index + 1;.

The test code will error if you supply values that produce any duplicates. And the final query indicates if there are any gaps since it can be inferred that if the table variable population did not error (hence no duplicates), and the total number of values is the expected number, then there could only be gaps (i.e. missing values) IF either or both of the actual MIN and MAX values are outside of the expected values.

PLEASE NOTE that this test code does not imply that any of the values are pre-generated or need to be stored. The code only stores the values in order to test for uniqueness and min / max values. In practice, all that is needed is the simple formula, and all that is needed to pass into it is:

  • the capacity (though that could also be hard-coded in this case)
  • the MMI / Integer value
  • the current "index"

So you only need to maintain 2 - 3 simple values.

DECLARE @TotalCapacity INT = 30; -- Modulo; -5 to +4 = 10 OR Int32.MinValue
                                 -- to Int32.MaxValue = (UInt32.MaxValue + 1)
DECLARE @MMI INT = 7; -- Modular Multiplicative Inverse (MMI) or
                      -- Integer (derived from @TotalCapacity)

DECLARE @Offset INT = 0; -- needs to stay at 0 if min and max values are hard-set
-----------
DECLARE @Index INT = (1 + @Offset); -- start

DECLARE @EnsureUnique TABLE ([OrderNum] INT NOT NULL IDENTITY(1, 1),
                             [Value] INT NOT NULL UNIQUE);
SET NOCOUNT ON;

BEGIN TRY
    WHILE (@Index < (@TotalCapacity + 1 + @Offset)) -- range + 1
    BEGIN
        INSERT INTO @EnsureUnique ([Value]) VALUES (
                 ((@Index * @MMI) % @TotalCapacity) - (@TotalCapacity / 2) + @Offset
                                                   );
        SET @Index += 1;
    END;
END TRY
BEGIN CATCH
    DECLARE @Error NVARCHAR(4000) = ERROR_MESSAGE();
    RAISERROR(@Error, 16, 1);
    RETURN;
END CATCH;

SELECT * FROM @EnsureUnique ORDER BY [OrderNum] ASC;
SELECT COUNT(*) AS [TotalValues],
       @TotalCapacity AS [ExpectedCapacity],
       MIN([Value]) AS [MinValue],
       (@TotalCapacity / -2) AS [ExpectedMinValue],
       MAX([Value]) AS [MaxValue],
       (@TotalCapacity / 2) - 1 AS [ExpectedMaxValue]
FROM   @EnsureUnique;
Picoline answered 30/1, 2016 at 16:9 Comment(5)
I may be wrong but this looks like the most mathematically straightforward technique described in this thread so far. I'll give it a try and compare to the others.Snowplow
@GabrielS. FYI, I just added an UPDATE section with some example code, though it is written in T-SQL ;-)Picoline
Thanks, this is going to be helpful!Snowplow
@GabrielS. No problem. I also just added a clarification paragraph just above the example code to make it clear that it tests for both "no gaps" and "no duplicates".Picoline
See my updated original post. I managed to implement this for my required range and the results seem pretty good, comparable to the LCG technique described in another answer. Definitely a handy technique for scenarios such as this. I had to tweak the numbers a bit though to get a random-enough sequence.Snowplow
C
14

If you don't need the random numbers to be cryptographically secure, you can use a Linear Congruential Generator.

An LCG is a formula of the form X_n+1 = X_n * a + c (mod m), it needs constant memory and constant time for every generated number.
If proper values for the LCG are chosen, it will have a full period length, meaning it will output every number between 0 and your chosen modulus.

An LCG has a full period if and only if:

  • The modulus and the increment are relatively prime, i.e. GCD(m, c) = 1
  • a - 1 is divisible by all prime factors of m
  • If m is divisible by 4, a - 1 must be divisible by 4.

Our modulus is 2 ^ 32, meaning a must be a number of form 4k + 1 where k is an arbitrary integer, and c must not be divisible by 2.

While this is a C# question I've coded a small C++ program to test the speed of this solution, since I'm more comfortable in that language:

#include <iostream>
#include <stdlib.h>

class lcg {
private:
    unsigned a, c, val;
public:
    lcg(unsigned seed=0) : lcg(seed, rand() * 4 + 1, rand() * 2 + 1) {}
    lcg(unsigned seed, unsigned a, unsigned c) {
        val = seed;
        this->a = a;
        this->c = c;
        std::cout << "Initiated LCG with seed " << seed << "; a = " << a << "; c = " << c << std::endl;
    }

    unsigned next() {
        this->val = a * this->val + c;
        return this->val;
    }
};

int main() {
    srand(time(NULL));
    unsigned seed = rand();
    int dummy = 0;
    lcg gen(seed);
    time_t t = time(NULL);
    for (uint64_t i = 0; i < 0x100000000ULL; i++) {
        if (gen.next() < 1000) dummy++; // Avoid optimizing this out with -O2
    }
    std::cout << "Finished cycling through. Took " << (time(NULL) - t) << " seconds." << std::endl;
    if (dummy > 0) return 0;
    return 1;
}

You may notice I am not using the modulus operation anywhere in the lcg class, that's because we use 32 bit integer overflow for our modulus operation.
This produces all values in the range [0, 4294967295] inclusive.
I also had to add a dummy variable for the compiler not to optimize everything out.
With no optimization this solution finishes in about 15 seconds, while with -O2, a moderate optimization it finishes under 5 seconds.

If "true" randomness is not an issue, this is a very fast solution.

Complement answered 30/1, 2016 at 16:27 Comment(10)
lcg(seed, a, c); creates a temporary lcg object, it does not modify the current instance. This breaks it quite badly: just try printing the first 20 generated numbers or so. Rewriting the first constructor lcg(int seed=0) : lcg(seed, rand() * 4 + 1, rand() * 2 + 1) { } makes it work as intended. Also, you should probably work with unsigned int, signed overflow is undefined in C++ and compilers optimise quite aggressively nowadays. With those changes, it works as intended, although I'm not qualified to comment on the quality of the generated numbers.Afteryears
@hvd Thank you for pointing these issues out. I've corrected my answer.Complement
This looks promising as well, I'll try to compare its performance with other acceptable techniques proposed here and then try to judge which is the most appropriate.Snowplow
I spot an in hindsight very obvious non-randomness in the results: it alternates between even and odd numbers. This is not necessarily a problem, it depends on how the numbers will be used.Afteryears
@hvd Yep, that's a flaw of the LCG method, quoting Wikipedia "A further problem of LCGs is that the lower-order bits of the generated sequence have a far shorter period than the sequence as a whole if m is set to a power of 2." You could use an LCG of period 2^32+n and drop the highest n results to preserve uniqueness. An example of such number with a nice factorization is 2^32+4 = 2^2 · 5^2 · 13 · 41 · 61 · 1321 since it has a bunch of small factors.Complement
While lcgs are rather flawed (particularly the least significant digits), picking good numbers for a and c makes a great difference. Don Knuth has a bucket list of what should be fulfilled and picking a random a is unlikely to fulfill them. Meaning, pick some that have already been shown to be reasonably good.Renae
It should make the numbers appear more random by rotating the integer halfway and xoring it with the original. this->val = a * this->val + c; unsigned rot = this->val << 16 & this->val >> 16; this->val ^= rot; return this->val;Mozambique
@Mozambique this->val << 16 & this->val >> 16 is just 0. If you meant | rather than &, then you no longer get each number once and only once.Afteryears
@hvd Good points... I checked whether that operation was really a 1:1 function over 0<=x<2^32 by xoring zero over the range and checking that there were still 1000 results < 1000, d'oh.Mozambique
I just tested the LCG with a and c set to the values listed at "Numerical Recipes" in the Wikipedia article and i'm very pleased with the results! See my updated original post.Snowplow
P
8

Is there a way in .NET

Actually, this can be done in most any language

to generate a sequence of all the 32-bit integers (Int32)

Yes.

in random order,

Here we need to agree on terminology since "random" is not what most people think it is. More on this in a moment.

without repetitions,

Yes.

and in a memory-efficient manner?

Yes.

Memory-efficient would mean using a maximum of just a few hundred mega bytes of main memory.

Ok, so would using almost no memory be acceptable? ;-)

Before getting to the suggestion, we need to clear up the matter of "randomness". Something that is truly random has no discernible pattern. Hence, running the algorithm millions of times in a row could theoretically return the same value across all iterations. If you throw in the concept of "must be different from the prior iteration", then it is no longer random. However, looking at all of the requirements together, it seems that all that is really being asked for is "differing patterns of distribution of the integers". And this is doable.

So how to do this efficiently? Make use of Modular multiplicative inverses. I used this to answer the following Question which had a similar requirement to generate non-repeating, pseudo-random, sample data within certain bounds:

Generate different random time in the given interval

I first learned about this concept here ( generate seemingly random unique numeric ID in SQL Server ) and you can use either of the following online calculators to determine your "Integer" and "Modular Multiplicative Inverses (MMI)" values:

Applying that concept here, you would use Int32.MaxSize as the Modulo value.

This would give a definite appearance of random distribution with no chance for collisions and no memory needed to store already used values.

The only initial problem is that the pattern of distribution is always the same given the same "Integer" and "MMI" values. So, you could come up with differing patterns by either adding a "randomly" generated Int to the starting value (as I believe I did in my answer about generating the sample data in SQL Server) or you can pre-generate several combinations of "Integer" and corresponding "MMI" values, store those in a config file / dictionary, and use a .NET random function to pick one at the start of each run. Even if you store 100 combinations, that is almost no memory use (assuming it is not in a config file). In fact, if storing both as Int and the dictionary uses Int as an index, then 1000 values is approximately 12k?


UPDATE

Notes:

  • There is a pattern in the results, but it is not discernible unless you have enough of them at any given moment to look at in total. For most use-cases, this is acceptable since no recipient of the values would have a large collection of them, or know that they were assigned in sequence without any gaps (and that knowledge is required in order to determine if there is a pattern).
  • Only 1 of the two variable values -- "Integer" and "Modular Multiplicative Inverse (MMI)" -- is needed in the formula for a particular run. Hence:
    • each pair gives two distinct sequences
    • if maintaining a set in memory, only a simple array is needed, and assuming that the array index is merely an offset in memory from the base address of the array, then the memory required should only be 4 bytes * capacity (i.e. 1024 options is only 4k, right?)

Here is some test code. It is written in T-SQL for Microsoft SQL Server since that is where I work primarily, and it also has the advantage of making it real easy-like to test for uniqueness, min and max values, etc, without needing to compile anything. The syntax will work in SQL Server 2008 or newer. For SQL Server 2005, initialization of variables had not been introduced yet so each DECLARE that contains an = would merely need to be separated into the DECLARE by itself and a SET @Variable = ... for however that variable is being initialized. And the SET @Index += 1; would need to become SET @Index = @Index + 1;.

The test code will error if you supply values that produce any duplicates. And the final query indicates if there are any gaps since it can be inferred that if the table variable population did not error (hence no duplicates), and the total number of values is the expected number, then there could only be gaps (i.e. missing values) IF either or both of the actual MIN and MAX values are outside of the expected values.

PLEASE NOTE that this test code does not imply that any of the values are pre-generated or need to be stored. The code only stores the values in order to test for uniqueness and min / max values. In practice, all that is needed is the simple formula, and all that is needed to pass into it is:

  • the capacity (though that could also be hard-coded in this case)
  • the MMI / Integer value
  • the current "index"

So you only need to maintain 2 - 3 simple values.

DECLARE @TotalCapacity INT = 30; -- Modulo; -5 to +4 = 10 OR Int32.MinValue
                                 -- to Int32.MaxValue = (UInt32.MaxValue + 1)
DECLARE @MMI INT = 7; -- Modular Multiplicative Inverse (MMI) or
                      -- Integer (derived from @TotalCapacity)

DECLARE @Offset INT = 0; -- needs to stay at 0 if min and max values are hard-set
-----------
DECLARE @Index INT = (1 + @Offset); -- start

DECLARE @EnsureUnique TABLE ([OrderNum] INT NOT NULL IDENTITY(1, 1),
                             [Value] INT NOT NULL UNIQUE);
SET NOCOUNT ON;

BEGIN TRY
    WHILE (@Index < (@TotalCapacity + 1 + @Offset)) -- range + 1
    BEGIN
        INSERT INTO @EnsureUnique ([Value]) VALUES (
                 ((@Index * @MMI) % @TotalCapacity) - (@TotalCapacity / 2) + @Offset
                                                   );
        SET @Index += 1;
    END;
END TRY
BEGIN CATCH
    DECLARE @Error NVARCHAR(4000) = ERROR_MESSAGE();
    RAISERROR(@Error, 16, 1);
    RETURN;
END CATCH;

SELECT * FROM @EnsureUnique ORDER BY [OrderNum] ASC;
SELECT COUNT(*) AS [TotalValues],
       @TotalCapacity AS [ExpectedCapacity],
       MIN([Value]) AS [MinValue],
       (@TotalCapacity / -2) AS [ExpectedMinValue],
       MAX([Value]) AS [MaxValue],
       (@TotalCapacity / 2) - 1 AS [ExpectedMaxValue]
FROM   @EnsureUnique;
Picoline answered 30/1, 2016 at 16:9 Comment(5)
I may be wrong but this looks like the most mathematically straightforward technique described in this thread so far. I'll give it a try and compare to the others.Snowplow
@GabrielS. FYI, I just added an UPDATE section with some example code, though it is written in T-SQL ;-)Picoline
Thanks, this is going to be helpful!Snowplow
@GabrielS. No problem. I also just added a clarification paragraph just above the example code to make it clear that it tests for both "no gaps" and "no duplicates".Picoline
See my updated original post. I managed to implement this for my required range and the results seem pretty good, comparable to the LCG technique described in another answer. Definitely a handy technique for scenarios such as this. I had to tweak the numbers a bit though to get a random-enough sequence.Snowplow
A
3

A 32 bit PRP in CTR mode seems like the only viable approach to me (your 4th variant).

You can either

  • Use a dedicated 32 bit block cipher.

    Skip32, the 32 bit variant of Skipjack is a popular choice.

    As a tradeoff between quality/security and performance you can adjust the number of rounds to your needs. More rounds are slower but more secure.

  • Length-preserving-encryption (a special case of format-preserving-encryption)

    FFX mode is the typical recommendation. But in its typical instantiations (e.g. using AES as underlying cipher) it'll be much slower than dedicated 32 bit block ciphers.

Note that many of these constructions have a significant flaw: They're even permutations. That means that once you have seen 2^32-2 outputs, you'll be able to predict the second-to-last output with certainty, instead of only 50%. I think Rogaways AEZ paper mentions a way to fix this flaw.

Alexine answered 31/1, 2016 at 10:22 Comment(3)
It's not clear to me how i should use Skip32 for my purpose, could you please give me some hints? Should i iterate across the entire range of integers and encrypt them as they come, thus resulting the random sequence?Snowplow
@GabrielS. Use the seed as key. Then encrypt a counter. Unique inputs produce unique outputs and a counter is obviously unique.Alexine
See my updated original post. The generated numbers look really good when it comes to randomness, however it takes over 9 minutes to generate all the integers, even with a single round.Snowplow
S
2

I'm going to preface this answer by saying I realize that some of the other answers are infinitely more elegant, and probably fit your needs better than this one. This is most certainly a brute-force approach to this problem.

If getting something truly random* (or pseudo-random* enough for cryptographic purposes) is important, you could generate a list of all integers ahead of time, and store them all on disk in random order ahead of time. At the run time of your program, you then read those numbers from the disk.

Below is the basic outline of the algorithm I'm proposing to generate these numbers. All 32-bit integers can be stored in ~16 GiB of disk space (32 bits = 4 bytes, 4 bytes / integer * 2^32 integers = 2^34 bytes = 16 GiB, plus whatever overhead the OS/filesystem needs), and I've taken "a few hundred megabytes" to mean that you want to read in a file of no more than 256 MiB at a time.

  1. Generate 16 GiB / 256 MiB = 64 ASCII text files with 256 MiB of "null" characters (all bits set to 0) each. Name each text file "0.txt" through "64.txt"
  2. Loop sequentially from Int32.MinValue to Int32.MaxValue, skipping 0. This is the value of the integer you're currently storing.
  3. On each iteration, generate a random integer from 0 to UInt32.MaxValue from the source of randomness of your choice (hardware true random generator, pseudo-random algorithm, whatever). This is the index of the value you're currently storing.
  4. Split the index into two integers: the 6 most significant bits, and the remaining 26. Use the upper bits to load the corresponding text file.
  5. Multiply the lower 26 bits by 4 and use that as an index in the opened file. If the four bytes following that index are all still the "null" character, encode the current value into four ASCII characters, and store those characters in that position. If they are not all the "null" character, go back to step 3.
  6. Repeat until all integers have been stored.

This would ensure that the numbers are from a known source of randomness but are still unique, rather than having the limitations of some of the other proposed solutions. It would take a long time to "compile" (especially using the relatively naive algorithm above), but it meets the runtime efficiency requirements.

At runtime, you can now generate a random starting index, then read the bytes in the files sequentially to obtain a unique, random*, non-repeating sequence of integers. Assuming that you're using a relatively small number of integers at once, you could even index randomly into the files, storing which indices you've used and ensuring a number is not repeated that way.

(* I understand that the randomness of any source is lessened by imposing the "uniqueness" constraint, but this approach should produce numbers relatively close in randomness to the original source)

TL;DR - Shuffle the integers ahead of time, store all of them on disk in a number of smaller files, then read from the files as needed at runtime.

Soluk answered 31/1, 2016 at 2:44 Comment(1)
While disk space sounds like a good "alternative" to main memory, I believe using 16 GB of it for this purpose seems a bit exaggerated. Other than this, though, it fits the requirements, it's not a bad solution if you're willing to trade any other resource for main memory, no matter how much of it.Snowplow
F
1

As your numbers per your definition are supposed to be random then there is by definition no other way than to store all of then as the number have no intrinsic relation to each other. So this would mean that you have to store all values you used in order to prevent them from being used again.

However, in computing the pattern just needs to be not "noticable". Usually the system calculates a random number by performing multiplication operations with huge predetermined values and timer values in such a way that they overflow and thus appear randomly selected. So either you use your third option or you have to think about generating these pseudo random numbers in a way that you can reproduce the sequence of every number generated and check if something in reoccuring. This obviously would be extremely computationally expensive but you asked for memory efficiency.

So you could store the number you seeded you random generator with and the number of elements you generated. Each time you need a new number, reseed the generator and iterate through the number of elements you generated + 1. This is your new number. Now reseed and iterate through the sequence again to check if it occured before.

So something like this:

int seed = 123;
Int64 counter = 0;
Random rnd = new Random(seed);

int GetUniqueRandom()
{
    int newNumber = rnd.Next();
    Random rndCheck = new Random(seed);

    counter++;

    for (int j = 0; j < counter; j++)
    {
        int checkNumber = rndCheck.Next();

        if (checkNumber == newNumber)
            return GetUniqueRandom();
    }

    return newNumber;        
}

EDIT: It was pointed out that counter will reach a huge value and there's no telling if it will overflow before you got all of the 4 billion values or not.

Florri answered 30/1, 2016 at 13:20 Comment(7)
Given that one of the OP's objections to one of the ideas in the question was that "it would get ridiculously slow", I would think focusing exclusively on memory efficiency won't be good enough, but I do like the creativity in your approach.Afteryears
I think I see another problem: your counter may overflow since you may need (and almost certainly will need) more than int.MaxValue rnd.Next() calls until you will have obtained all possible results.Afteryears
Also, rnd.Next() never returns negative numbers or int.MaxValue, but that should be easy enough to fix by calling rnd.Next(65536) twice and combining the results.Afteryears
Of course, i'm not looking for absolute randomness, pseudo-randomness is fine. I like your approach, i can't see any flaws in it at the moment (assuming hvd's corrections), although it's a bit extreme when taking into account computational time. However it definitely seems to meet the memory requirements, i can't see a way to go significantly lower than this.Snowplow
-1 for the obvious wrong statement of ""i computing there is no such thing as real randomness.". Hardware random number generation is truly random and implemented in quite a lot pf processors these days.Slily
I really don't see how this is going to work, even if you would use a random number generator instead of Random... You're basically calculating collisions recursively -- which means a SO, a so much overhead in the calculations that it comes to a grinding halt (If you're over 50% of the collection, you get a million numbers to check times 2000 for each new number you generate?!?) -- see my first algorithm in my answer: you already get 13 collisions in that case). The last number will probably never be calculated in our lifetimes...Ambert
The recursion does not need to be a problem. It's tail recursion, which is easily rewritten using a loop. (The other problems still apply, of course.)Afteryears
A
1

Nice puzzle. A few things come to mind:

  • We need to store which items have been used. If approximately is good enough, you might want to use a bloom filter for this. But since you specifically state that you want all numbers, there's only one data structure for this: a bit vector.
  • You probably want to use a pseudo random generator algorithm with a long period.
  • And the solution probably involves using multiple algorithm.

My first attempt was to figure out how good pseudo random number generation works with a simple bit vector. I accept collisions (and therefore a slowdown), but definitely not too many collisions. This simple algorithm will generate about half the numbers for you in a limited amount of time.

static ulong xorshift64star(ulong x)
{
    x ^= x >> 12; // a
    x ^= x << 25; // b
    x ^= x >> 27; // c

    return x * 2685821657736338717ul;
}

static void Main(string[] args)
{
    byte[] buf = new byte[512 * 1024 * 1024];
    Random rnd = new Random();

    ulong value = (uint)rnd.Next(int.MinValue, int.MaxValue);
    long collisions = 0;

    Stopwatch sw = Stopwatch.StartNew();

    for (long i = 0; i < uint.MaxValue; ++i)
    {
        if ((i % 1000000) == 0)
        {
            Console.WriteLine("{0} random in {1:0.00}s (c={2})", i, sw.Elapsed.TotalSeconds, collisions - 1000000);
            collisions = 0;
        }

        uint randomValue; // result will be stored here
        bool collision;

        do
        {
            value = xorshift64star(value);
            randomValue = (uint)value;

            collision = (buf[randomValue >> 4] & (1 << (int)(randomValue & 7))) != 0;
            ++collisions;
        }
        while (collision);

        buf[randomValue >> 4] |= (byte)(1 << (int)(randomValue & 7));
    }

    Console.ReadLine();
}

After about 1,9 billion random numbers, the algorithm will start to come to a grinding halt.

1953000000 random in 283.74s (c=10005932) [...] 2108000000 random in 430.66s (c=52837678)

So, let's for the sake of argument say that you're going to use this algorithm for the first +/- 2 billion numbers.

Next, you need a solution for the rest, which is basically the problem that the OP described. For that, I'd sample random numbers into a buffer and combine the buffer with the Knuth shuffle algorithm. You can also use this right from the start if you like.

This is what I came up with (probably still buggy so do test...):

static void Main(string[] args)
{
    Random rnd = new Random();

    byte[] bloom = new byte[512 * 1024 * 1024];
    uint[] randomBuffer = new uint[1024 * 1024];

    ulong value = (uint)rnd.Next(int.MinValue, int.MaxValue);
    long collisions = 0;

    Stopwatch sw = Stopwatch.StartNew();
    int n = 0;

    for (long i = 0; i < uint.MaxValue; i += n)
    {
        // Rebuild the buffer. We know that we have uint.MaxValue-i entries left and that we have a 
        // buffer of 1M size. Let's calculate the chance that you want any available number in your 
        // buffer, which is now:

        double total = uint.MaxValue - i;
        double prob = ((double)randomBuffer.Length) / total;

        if (i >= uint.MaxValue - randomBuffer.Length)
        {
            prob = 1; // always a match.
        }

        uint threshold = (uint)(prob * uint.MaxValue);
        n = 0;

        for (long j = 0; j < uint.MaxValue && n < randomBuffer.Length; ++j)
        {
            // is it available? Let's shift so we get '0' (unavailable) or '1' (available)
            int available = 1 ^ ((bloom[j >> 4] >> (int)(j & 7)) & 1);

            // use the xorshift algorithm to generate a random value:
            value = xorshift64star(value);

            // roll a die for this number. If we match the probability check, add it.
            if (((uint)value) <= threshold * available)
            {
                // Store this in the buffer
                randomBuffer[n++] = (uint)j;

                // Ensure we don't encounter this thing again in the future
                bloom[j >> 4] |= (byte)(1 << (int)(j & 7));
            }
        }

        // Our buffer now has N random values, ready to be emitted. However, it's 
        // still sorted, which is something we don't want. 
        for (int j = 0; j < n; ++j)
        {
            // Grab index to swap. We can do this with Xorshift, but I didn't bother.
            int index = rnd.Next(j, n);

            // Swap
            var tmp = randomBuffer[j];
            randomBuffer[j] = randomBuffer[index];
            randomBuffer[index] = tmp;
        }

        for (int j = 0; j < n; ++j)
        {
            uint randomNumber = randomBuffer[j];
            // Do something with random number buffer[i]
        }

        Console.WriteLine("{0} random in {1:0.00}s", i, sw.Elapsed.TotalSeconds);
    }

    Console.ReadLine();
}

Back to the requirements:

Is there a way in .NET to generate a sequence of all the 32-bit integers (Int32) in random order, without repetitions, and in a memory-efficient manner? Memory-efficient would mean using a maximum of just a few hundred mega bytes of main memory.

Cost: 512 MB + 4 MB. Repetitions: none.

It's pretty fast. It just isn't 'uniformly' fast. Every 1 million numbers, you have to recalculate the buffer.

What's also nice: both algorithms can work together, so you can first generate the first -say- 2 billion numbers very fast, and then use the second algorithm for the rest.

Ambert answered 30/1, 2016 at 14:9 Comment(6)
Just tested, on my laptop it's approximately 1M random numbers in 20 seconds. A possible optimization is in available and the multiplication; you can also force it to 0 and -1 and do an and instead of a multiplication.Ambert
So if the final memory consumption is around 512MB, isn't the solution proposed by hvd in the comment above more straightforward and maybe even faster? Or maybe I miss something?Snowplow
Yes, it's total nonsense. :) Hvb suggest picking a random number out of the set of remaining numbers - which implies you have to store them all. Hence you need 4G for that -- and otherwise you'll end up with collisions (which is the first algorithm).Ambert
@Slily Really? My solution does work and fulfills all the requirements; I acknowledge that your Twister (that I didn't see before today) is probably a much better solution, but there's no need to be rude about it.Ambert
@Slily I am fairly certain that it would be considered acceptable to yell at your monitor that someone is an idiot, including all of the intricate details as to why, and then adding a comment like: "Pseudo-Random-Number-Generation is not exactly new, and some mathematicians have made some good solutions here that do not require remembering all numbers. This book, _____, for example, is a good place to start.". Just a thought :-) Or, at the very least, please use spell-check prior adding comments in which you call others idiots ;-)Picoline
@Ambert Not bothering to get a name of three letters right? Calling my comment total nonsense? Yet still calling out others on being rude? Hypocrite. No, nothing about my comment would require 4GB of memory. Given a 512MB bit array, it's certainly possible to find the N-th zero bit for any value of N, without any additional memory requirements.Afteryears
C
1

One of the easiest solutions is to use an block encrytion algorithm like AES in countermode. You need a seed which equals the key in AES. Next you need a counter which is incremented for each new random value. The random value is the result of encrypting the counter with the key. Since the cleartext (counter) and the random number (ciphertext) is bijectiv and because of the pigeon hole principle the random numbers are unique (for the blocksize).

Memory efficiency: you only need to store the seed and the counter.

The only limmitation is that AES has 128 bit block size instead of your 32 bit. So you might need to increase to 128 bit or find a block cipher with 32 bit block size.

For your IEnumerable you can write a wrapper. The index is the counter.

Disclaimer: You are asking for non-repeating/unique: This disqualifies from random because normally you should see collisions in random numbers. Therefore you should not use it for a long sequence. See also https://crypto.stackexchange.com/questions/25759/how-can-a-block-cipher-in-counter-mode-be-a-reasonable-prng-when-its-a-prp

Codling answered 30/1, 2016 at 17:2 Comment(2)
I suspected that cryptography might have some algorithms that achieve something similar to what I'm after, however the requirement of uniqueness is fixed. Think of it as randomly shuffling a sequence of all the integers, no more, no less. As per your disclaimer, I assume AES is not the best fit here.Snowplow
Then the only problem with this is that AES has a block size of 128 bit instead of 32 bit. But as said any other block cipher will do it too. The only difference between thoose will be how secure and how biased the random numbers will be. I don't know of any non-broken 32 bit block cipher. But since you already have violations like in bias/uniqueness and stated you don't need high security this answer fits your question. Someone already mentioned Skipjack32 (I never used this one before) The disclaimer was mostly for other people to consider in case they have other requirements.Codling
A
0

You could try this homebrew block-cipher:

public static uint Random(uint[] seed, uint m)
{   
    for(int i = 0; i < seed.Length; i++)
    {
        m *= 0x6a09e667;
        m ^= seed[i];
        m += m << 16;
        m ^= m >> 16;
    }
    return m;
}

Test code:

const int seedSize = 3; // larger values result in higher quality but are slower
var seed = new uint[seedSize];
var seedBytes = new byte[4 * seed.Length];
new RNGCryptoServiceProvider().GetBytes(seedBytes);
Buffer.BlockCopy(seedBytes, 0, seed, 0, seedBytes.Length);  

for(uint i = 0; i < uint.MaxValue; i++)
{
    Random(seed, i);
}

I haven't checked the quality of its outputs yet. Runs in 19 sec on my computer for seedSize = 3.

Alexine answered 11/2, 2016 at 19:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.