Fastest way to bring a range [from, to] of 64-bit integers into pseudo-random order, with same results across all platforms?

Asked 21/4, 2023 at 21:57 Answered 3/5, 2023 at 15:17

Solved c++optimization cryptography tbb uniform-distribution

-8

Given some interval [a, b] of indices (64-bit unsigned integers), I would like to quickly obtain an array that contains all of these indices ordered according to a uniformly distributed hash function, appearing random but actually being the same on every system, regardless of the used C++ implementation.

The goal is to find highly optimized such methods. You may use shared memory parallelism via Intel's oneTBB to improve performance.

Something like

vector<uint64_t> distributeIndices(uint64_t from, uint64_t to) {
    unordered_set<uint64_t> uset;
    for (uint64_t i = from; i <= to; i++)
        uset.insert(i);
    return vector<uint64_t>(uset.begin(), uset.end());
}

would generate desired results if unordered_set<uint64_t> would always use the same and a pseudo-randomly distributed hash function on every implementation, both of which is not the case. It would also be an inefficient solution. TBB equivalent:

tbb::concurrent_vector<uint64_t> distributeIndices(uint64_t from, uint64_t to) {
    tbb::concurrent_unordered_set<uint64_t> uset;
    tbb::parallel_for(from, to + 1, [&uset](uint64_t i) {
        uset.insert(i);
    }); // NOTE: This is only to illustrate a parallel loop, sequential insertion is actually faster.
    return tbb::concurrent_vector<uint64_t>(uset.begin(), uset.end());
}

Note that distributeIndices(from, to) should return a random-looking permutation of {from, ..., to}.

Merely providing some hash functions is insufficient, and none of the answers at "Generating a deterministic int from another with no duplicates" actually answered that question.
Consider transform from this answer. Notably, a cyclic distribution is not a pseudo-random distribution:
1. Sorting {from, ..., to} w.r.t. (uint64_t a, uint64_t b) { return transform(a) < transform(b) }
  - distributeIndices(42, 42+99999999)[0, ..., 999] looks not random at all:
2. Sorting {from, ..., to} w.r.t. (uint64_t a, uint64_t b) { return transform(a) % n < transform(b) % n }
  - distributeIndices(42, 42+99999999)[0, ..., 999] looks not random at all:
3. Assigning each x in {from, ..., to} to transform(x - from) % n + from
  - distributeIndices(42, 42+99999999) happens to be bijective (since 100000000 and 39293 are coprime), but distributeIndices(42, 42+99999999)[0, ..., 999] looks not random at all:
  - distributeIndices(42, 42+3929299) is not bijective. It assigns only 100 different elements, cycling with a period of 100:
4. Assigning each x in {from, ..., to} to transform(x - from) + from
  - distributeIndices(42, 42+99999999) is not bijective, e.g. it assigns 3929375282657 > 42+99999999.
In particular, a linear congruential generator is not in general a bijection. But if you can make it such for each interval [from, to], while also hiding its cyclic nature, how?

Consequently, an answer should provide a specific hash function (and why it is fast and uniformly distributed), and how to efficiently utilize it in order to compute distributeIndices(from, to).

Again, it is crucial that distributeIndices(from, to) has the same result regardless of where it runs and what compiler it used, which must be guaranteed according to the C++ standard. But it is fine if for example distributeIndices(0,2) assigns a different index to 1 than distributeIndices(0,3) does.

Acceptable return types are std::vector, tbb::concurrent_vector, and dynamic arrays, consisting of elements of type uint64_t.
The function should perform well on ranges that include billions of indices.

[ In case you are curious on why this could be useful: Consider that there are different processes on different computing nodes, communicating via Message Passing Interface, and they should not send actual data (which is big), but only indices of data entries which they are processing. At the same time, the order to process data should be pseudo-randomized so that the progress speed is not "bouncing" much (which it does, when processing along the ordered indices). This is essential for reliable predictions of how long the overall computation will take. So every node must know which transformed index refers to which actual index, i.e. every node must compute the same result for distributeIndices(from, to). ]

The fastest correctly working solution wins the accepted answer.

No C/C++ code, no acceptable answer.
(Except when it is a proof that the problem cannot be solved efficiently.)

I will test solutions with GCC 11.3 -O3 on my old i7-3610QM laptop with 8 hardware threads on 100 million indices (i.e. distributeIndices(c, c + 99999999)), and may change accepted answers when future answers provide a better performing solution.

Testing code (run up to 10 times, pick fastest execution):

int main(int argc, char* argv[]) {
    uint64_t c = argc < 3 ? 42 : atoll(argv[1]);
    uint64_t s = argc < 3 ? 99999 : atoll(argv[2]); // 99999999 for performance testing
    for (unsigned i = 0; i < 10; i++) {
        chrono::time_point<chrono::system_clock> startTime = chrono::system_clock::now();
        auto indices = distributeIndices(c, c + s);
        chrono::microseconds dur = chrono::duration_cast<chrono::microseconds>(chrono::system_clock::now() - startTime);
        cout << durationStringMs(dur) << endl;
        // [... some checks ...]
#if 0 // bijectivity check
        set<uint64_t> m = set<uint64_t>(indices.begin(), indices.end());
        cout << "min: " << *m.begin() << " , max: " << *prev(m.end()) << ", #elements: " << m.size() << endl;
#endif
        cout << "required average: " << round((2.0L * c + s) / 2, 2) << endl;
        long double avg = accumulate(indices.begin(), indices.end(), __uint128_t(0)) / static_cast<long double>(indices.size());
        string sa = round(avg, 2);
        cout << "actual average:   " << sa << endl;
        auto printTrendlineHelpers = [](uint64_t minX, string avgX, uint64_t maxX, uint64_t minY, string avgY, uint64_t maxY) {
            cout << "Trendline helpers:" << endl;
            cout << "[max] " << minX << " " << maxY << " " << avgX << " " << maxY << " " << maxX << " " << maxY << endl;
            cout << "[avg] " << minX << " " << avgY << " " << avgX << " " << avgY << " " << maxX << " " << avgY << endl;
            cout << "[min] " << minX << " " << minY << " " << avgX << " " << minY << " " << maxX << " " << minY << endl;
        };
        // Print some plottable data, for e.g. https://www.rapidtables.com/tools/scatter-plot.html
        unsigned plotAmount = 2000;
        auto printPlotData = [&](uint64_t start, uint64_t end) {
            long double rng = static_cast<long double>(end - start);
            long double avg = accumulate(indices.begin() + start, indices.begin() + end, __uint128_t(0)) / rng;
            cout << "\ndistributeIndices(" << c << ", " << c << "+" << s << ")[" << start << ", ..., " << end - 1 << "]: (average " << round(avg, 2) << ")" << endl;
            stringstream ss;
            for (unsigned i = start; i < end; i++)
                ss << i << " " << indices[i] << (i + 1 == end ? "" : " ");
            cout << ss.str() << endl;
            printTrendlineHelpers(start, round(start + rng / 2, 2), end - 1, c, sa, c + s);
        };
        printPlotData(0, plotAmount); // front
        printPlotData(indices.size() / 2 - plotAmount / 2, indices.size() / 2 + plotAmount / 2); // middle
        printPlotData(indices.size() - plotAmount, indices.size()); // back
#if 1 // Print average course
        if (s >= 1000000)
            plotAmount *= 10;
        stringstream ss;
        for (uint64_t start = 0; start < indices.size(); start += plotAmount) {
            uint64_t end = min(start + plotAmount, indices.size());
            uint64_t i = start + (end - start) / 2;
            long double avg = accumulate(indices.begin() + start, indices.begin() + end, __uint128_t(0)) / static_cast<long double>(end - start);
            ss << i << " " << round(avg, 2) << (end == indices.size() ? "" : " ");
        }
        cout << "\nAverage course of distributeIndices(" << c << ", " << c << "+" << s << ") with slices of size " << plotAmount << ":\n" << ss.str() << endl;
        printTrendlineHelpers(c, sa, c + s, c, sa, c + s);
        break;
#endif
    }
    return 0;
}

Storage of the result (e.g. via static variables) is obviously not allowed.
uint64_t from and uint64_t to cannot be considered constexpr.

My two (unsuitable) examples would be 14482.83 ms (14 s 482.83 ms) and 186812.68 ms (3 min 6 s 812.68 ms).
The second approach seems terribly slow, but upon closer inspection it is the only one that on my system actually distributes the values:

unordered_set<uint64_t> variant:
e.g. 100000041, 100000040, 100000039, 100000038, 100000037, ... // bad
tbb::concurrent_unordered_set<uint64_t> variant:
e.g. 67108864, 33554432, 16777216, 83886080, 50331648, ... // well-distributed, but not looking random
- distributeIndices(42, 42+99999999)[0, ..., 999] with polynomial trendline:
  
  The above distribution looks ordered, not random.

An exemplary distribution suggesting randomness can be obtained from olegarch's answer, as of Apr 30, 2023.

// LCG params from: https://nuclear.llnl.gov/CNP/rng/rngman/node4.html
std::vector<uint64_t> distributeIndices(uint64_t lo, uint64_t hi) {
    uint64_t size = hi - lo + 1;
    std::vector<uint64_t> vec(size);
    for(uint64_t i = 0; i < size; i++)
        vec[i] = i + lo;
    uint64_t rnd = size ^ 0xBabeCafeFeedDad;
    for(uint64_t i = 0; i < size; i++) {
        rnd = rnd * 2862933555777941757ULL + 3037000493;
        uint64_t j = rnd % size;
        uint64_t tmp = vec[i]; vec[i] = vec[j]; vec[j] = tmp;
    }
    return std::move(vec);
}

Note that the solution is still incorrect since it doesn't provide uniform distributions for all ranges, as shown further below. It also does not utilize parallel computing, but it performs well: Computing 100 million indices took 3235.18 ms on my i7-3610QM.

front ; distributeIndices(42, 42+99999999)[0, ..., 1999] with polynomial trendline:
middle ; distributeIndices(42, 42+99999999)[49999000, ..., 50000999] with polynomial trendline:
back ; distributeIndices(42, 42+99999999)[99998000, ..., 99999999] with polynomial trendline:

The above distribution is looking random, even though the average seems to be a little low and bouncy in the beginning. Its global trend looks as follows.
average course of distributeIndices(42, 42+99999999) with polynomial trendline:

The polynomial trendline deviates up to 5% from the global average, so the distribution is not uniform.
And for some ranges, it gets worse:
average course of distributeIndices(0, 67108863) with polynomial trendline:
front ; distributeIndices(0, 67108863)[0, ..., 1999] with polynomial trendline:

This is clearly not an acceptable distribution.

An exemplary distribution with a flawless trendline can be obtained from Severin Pappadeux's answer, as of Apr 30, 2023. Following the suggestions, I added some parallelization.

uint64_t m = 0xd1342543de82ef95ULL; // taken from https://arxiv.org/pdf/2001.05304.pdf
uint64_t c = 0x1ULL;
inline auto lcg(uint64_t xi) -> uint64_t { // as LCG as it gets
    return m*xi + c;
}
inline auto cmp_lcg(uint64_t a, uint64_t b) -> bool {
    return lcg(a) < lcg(b);
}
auto distributeIndices(uint64_t from, uint64_t to) -> std::vector<uint64_t> {
    uint64_t size = to - from + 1;
    std::vector<uint64_t> z(size);
    tbb::parallel_for(uint64_t(0), size, [&](uint64_t i) {
        z[i] = from + i;
    }); // instead of std::iota(z.begin(), z.end(), from);
    tbb::parallel_sort(z.begin(), z.end(), cmp_lcg); // instead of std::sort(z.begin(), z.end(), cmp_lcg);
    return z;
}

To give an idea of performance boost via multithreading, computing 100 million indices on my i7-3610QM took 15925.91 ms sequentially and 3666.21 ms with parallelization (on 8 hardware threads).
On a computing cluster with Intel Xeon Platinum 8160 processors, I measured the (#cpu,duration[ms]) results (1,19174.65), (2,9862.29), (4,5580.47), (8,3402.05), (12,2119.28), (24,1606.78), and (48,1330.20).

It should also be noted, that the code is better optimized and runs much faster, when turning cmp_lcg into a lambda function, e.g. auto cmp_lcg = [](uint64_t a, uint64_t b) -> bool { return lcg(a) < lcg(b); };. This way, it performed best at 2608.15 ms on my i7-3610QM. Slightly even better performance can be reached when declaring global variables m and c as constexpr, or making them local or literals, which led to a duration of 2542.14 ms.

average course of distributeIndices(42, 42+99999999) with polynomial trendline:

But when looking at the actual distribution, it becomes apparent that it is not random, but ordered:
front ; distributeIndices(42, 42+99999999)[0, ..., 1999] with polynomial trendline:

The solution is incorrect regarding the task that distributions should appear to be random, but due to its mostly^✻ nice distributions, it is particularly useful for the previously mentioned MPI use case. So if there won't be any fully correct solutions, it will make an accepted answer as well — so long as there won't be given any plausible ranges where it has non-uniform distributions. Plausible here means, that values where the algorithm would run for at least several days, can be ignored.
^✻_{The factor 0xd1342543de82ef95 shows some weaknesses in the spectral test, and I haven't yet found a reason not to use 0x9e3779b97f4a7c15 instead.}

After all this, it should be clear what it means that the task is to combine

perceived randomness of a
uniform bijective distribution of
plausible arbitrary intervals, with
high performance.

I am very curious if there even exist any correct and well-performing solutions to the problem!
A negative answer to this question, with a corresponding proof, would of course also be acceptable.

Even better than distributeIndices(uint64_t, uint64_t) -> vector<uint64_t> would be an approach to not have to create a vector, but merely iterating the indices in pseudo-random order, but that would require each pseudo-random index to be efficiently computable from its actual index (without iterating all the elements before it). I would be surprised it that is possible, but I'd gladly be surprised. Such approaches are always considered better than the vector-constructing ones, and are compared amongst each other by the duration of iterating 100 million indices.

Paramnesia answered 21/4, 2023 at 21:57 Comment(2)

Comments have been moved to chat; please do not continue the discussion here. Before posting a comment below this one, please review the purposes of comments. Comments that do not request clarification or suggest improvements usually belong as an answer, on Meta Stack Overflow, or in Stack Overflow Chat. Comments continuing discussion may be removed. – Bustup 22/4, 2023 at 13:57

Related meta question (was automatically deleted (now only visible to users with more than 10,000 reputation points and to the OP)): Will there be done something about questions being closed for false reasons? – Miscall 1/7, 2023 at 11:50

Overview

The following solution constructs a bijective function F that maps a range of integers onto itself. This function can be used to compute a pseudo-random index directly from a source index such that the resulting pseudo-random indices are a permutation of the source indices.

There are three ideas (all borrowed from cryptography) that taken together allow for the construction of such a function: 1) a Pseudo-Random Function Family (PRF), 2) a Feistel network, and 3) a format-preserving-encryption (FPE). While these ideas draw on well-studied cryptography concepts, I believe the end product is probably unqiue and should definitely be considered insecure.

The basic strategy is to encrypt the source index to produce the target index. The secret sauce is to design the encryption to be bijective and use a range of integers as the domain. I have dubbed this method feisty for the use of a Feistel network.

Pseudo Random Function Family

The first step in the construction is creating a PRF that returns a pseudo-random value given a 64-bit input. We can create this family using a single function that also takes a subkey parameter that is used to select the particular function to use. The canonical PRF example uses AES to produce a 128-bit pseudo-random value. We will use the following function which is more efficient to evaluate (although much less secure) and produces a 64-bit pseudo-random value. The s0 parameter is the source index and s1 parameter is the subkey.

uint64_t pseudo_random_function(uint64_t s0, uint64_t s1) {
    auto a = s0 + s1;
    a ^= a >> 12;
    a ^= a << 25;
    a ^= a >> 27;
    return a * 0x2545f4914f6cdd1dull;
}

This function can be used directly to order the source indices producing a pseudo-random permutation as in Severin Pappadeux's answer which is equivalent to constructing an FPE using a prefix cipher. The main difference is that this PRF produces more "random" looking results than using the linear congruent generator as shown in the following plot.

Feistel Network

Instead of using the PRF directly, we will apply a Feistel network that leverages the PRF as its round function. The two key advantages of the Feistel network is 1) the operation is guaranteed to be invertible (i.e. bijective) even if the round function is not, and 2) the number of output bits can be selected to be at most one or two more than the number of input bits making the encoding range of the network at most four times larger than the input domain. The minimum number of rounds for security applications is suggested to be three. The following class implements a balanced Feistel network.

template<class PRF>
class FeistelNetwork {
public:
    FeistelNetwork(int number_of_bits, int number_rounds, PRF&& prf)
        : shift_((1 + number_of_bits) / 2)
        , mask_((uint64_t{1} << shift_) - 1)
        , nrounds_(number_rounds)
        , prf_(std::forward<PRF>(prf)) {
    }

    auto encode(uint64_t msg) const {
        auto [left, right] = split(msg);
        for (auto i = 0; i < nrounds_; ++i)
            round(left, right, Rounds[i]);
        return combine(left, right);
    }

    auto decode(uint64_t msg) const {
        auto [left, right] = split(msg);
        for (int i = nrounds_ - 1; i >= 0; --i)
            round(right, left, Rounds[i]);
        return combine(left, right);
    }

private:
    std::tuple<uint64_t, uint64_t> split(uint64_t msg) const {
        auto right = msg bitand mask_;
        auto left = (msg >> shift_) bitand mask_;
        return std::make_tuple(left, right);
    }

    uint64_t combine(uint64_t left, uint64_t right) const {
        return (left << shift_) bitor right;
    }

    void round(uint64_t& left, uint64_t& right, uint64_t constant) const {
        auto prf_value = prf_(right, constant) bitand mask_;
        auto r = left ^ prf_value;
        left = right;
        right = r;
    }

    static constexpr uint64_t Rounds[] = {
        0x88ef7267b3f978daull,
        0x5457c7476ab3e57full,
        0x89529ec3c1eec593ull,
        0x3fac1e6e30cad1b6ull,
        0x56c644080098fc55ull,
        0x70f2b329323dbf62ull,
        0x08ee98c0d05e3dadull,
        0x3eb3d6236f23e7b7ull,
        0x47d2e1bf72264fa0ull,
        0x1fb274465e56ba20ull,
        0x077de40941c93774ull,
        0x857961a8a772650dull
    };

    int shift_;
    uint64_t mask_;
    int nrounds_;
    PRF prf_;
};

Cycle-walk and Feisty Permutation

If the source index range happens to be an even power of two, then we can simply call encode on the Feistel network to map a source index to a pseudo-random target index. In general, however, the Feistel network may return an encoding that is outside the source index domain. The solution is to simply call encode on the out-of-range index until we get an index that is in the source index domain. This recursion will terminate because the Feistel network encryption is bijective and the domain is finite. For the worst case source index range (i.e. one more than an even power of two), their will be an average of almost four calls to encode for a balanced network or two for an unbalanced network. The following class implements the basic logic along with mapping the source index domain from min,max to 0,max-min.

Results

All of the code and images can be found at GitHub in the 76076957 directory. I used the following driver for testing and generating the performance metrics all of which use three rounds in the Feistel network. I wrote the code for clarity and I have not yet done any performance work, although, I think the inner loops are already pretty efficient.

#include "core/util/tool.h"
#include "core/chrono/stopwatch.h"
#include "core/string/lexical_cast_stl.h"

template<class Work>
void measure(std::ostream& os, std::string_view desc, Work&& work) {
    chron::StopWatch timer;
    timer.mark();
    if (work())
        os << fmt::format("{:>12s}: work failed", desc) << endl;
    auto millis = timer.elapsed_duration<std::chrono::milliseconds>().count();
    os << fmt::format("{:>12s}: {:5d} ms", desc, millis) << endl;
}

int tool_main(int argc, const char *argv[]) {
    ArgParse opts
        (
         argValue<'m'>("range", std::make_pair(0, 16), "Permutation range min:max"),
         argValue<'r'>("rounds", 3, "Number of rounds"),
         argFlag<'p'>("performance", "Measure performance"),
         argFlag<'s'>("sort", "Sort index based on PRF")
         );
    opts.parse(argc, argv);
    auto [min, max] = opts.get<'m'>();
    auto rounds = opts.get<'r'>();
    auto measure_performance = opts.get<'p'>();
    auto sort_index = opts.get<'s'>();

    if (measure_performance) {
        PseudoRandomPermutation perm(min, max, rounds, &pseudo_random_function);
        measure(cout, "Permutation", [&]() {
            for (auto i = perm.min(); i < perm.max(); ++i) {
                auto code = perm.encode(i);
                if (code < perm.min() or code > perm.max())
                    return true;
            }
            return false;
        });
    } else if (sort_index) {
        std::vector<uint64_t> codes;
        for (auto i = min; i < max; ++i)
            codes.push_back(i);
        std::sort(codes.begin(), codes.end(), [](uint64_t a, uint64_t b) {
            return iterate_prf(a, 3) < iterate_prf(b, 3);
        });
        for (auto elem : codes)
            cout << elem << endl;
    } else {
        std::set<uint64_t> codes;
        PseudoRandomPermutation perm(min, max, rounds, &pseudo_random_function);
        for (auto i = min; i < max; ++i) {
            auto code = perm.encode(i);
            assert(code >= min and code <= max);
            codes.insert(code);
            cout << i << " " << code << endl;
        }
        assert(codes.size() == max - min);
    }

    return 0;
}

I have not done any statistical tests, but have simply eyeballed the plots and based on the eyeball tests I believe this answer satisfies the criteria:

It looks random (cannot differentiate from std::shuffle).
It looks uniformly distributed (straight line cumulative sum).
It is efficient (tens of nanoseconds / index).
Computable directly from a single actual index.

Basic Performance Measurements

39ns / index on Mac M1 Pro (arm64, MacOSX)
52ns / index on Intel Xeon ES-2698 @ 2.2Ghz (x86, Ubuntu 20.04)

Randomness

The following two plots compare using std::shuffle versus feisty for creating the pseudo-random permuation for 20k indices. The third plot shows the cumulative sum of the pseudo-random indices which should be a straight line for a uniform distribution.

Round Behavior

Just for curiosity's sake, here are plots for using from 1 to 5 rounds of the Feistel network. As suggested by theory, there needs to be a least three rounds to achieve good results.

Finder answered 3/5, 2023 at 15:17 Comment(3)

This is an impressive solution. You even solved the bonus task of immediate permutation of any element! I measured 4329.25 ms (4 s 329.25 ms) for iterating 100 million indices. Here is a compiling code snippet that clarifies in how I measured performance of this solution. – Paramnesia 3/5, 2023 at 18:20

That is 43ns / element which is inline with my measurements of 39ns (M1 Max) and 52ns (Xeon). I just tried a tweak which unrolls the Feistel encode loop assuming three rounds and it dropped the times to 36ns and 46ns respectively. A meaningful improvement on the x86 hardware. If you need more performance, I can probably squeeze out a little more juice. – Finder 3/5, 2023 at 18:48

I hope my compiler unrolled that encode loop, given that I used a literal 3 as argument for rounds. I would be even more surprised if other people gave answers with immediate permutations that are faster than this one, especially seeing how poorly this question is received. I am already very glad I have some answer that deserves my bounty. But if you think your time could easily be beaten by a skilled cryptographer, you might want to consider squeezing. :) – Paramnesia 3/5, 2023 at 19:7

A simple idea: to stuff an array with incremental indices, and thereafter - just shuffle it, using an own, system-independent rand-generator.

You did not request a cryptographically secured permutation, and here I used just a simple LCG in this example. If you need to keep your shuffle cryptographically secured, I suggest you to use RC4. It has a good balance between security and performance.

#include <vector>
#include <algorithm>
#include <stdint.h>
#include <stdio.h>

// LCG params from: https://nuclear.llnl.gov/CNP/rng/rngman/node4.html
std::vector<uint64_t> distributeIndices(uint64_t lo, uint64_t hi) {
    uint64_t size = hi - lo + 1;
    uint64_t mask = 1;
    while(mask < size)
        mask <<= 1;
    mask--;
    std::vector<uint64_t> vec(size);
    for(uint64_t i = 0; i < size; i++)
        vec[i] = i + lo; 
    uint64_t rnd = (hi + size) ^ 0xBabeCafeFeedDad;
    for(uint64_t i = size - 1; i != 0; i--) {
        uint64_t j;
        do {
            rnd = rnd * 2862933555777941757ULL + 3037000493;
            j = rnd & mask;
        } while(j >= size);
        uint64_t tmp = vec[i]; vec[i] = vec[j]; vec[j] = tmp;
    }   
    return std::move(vec);
}

int main(int argc, char **argv) {
    uint64_t lo = atoll(argv[1]);
    uint64_t hi = atoll(argv[2]);
   std::vector<uint64_t> vec = distributeIndices(lo, hi);
   for (uint64_t x : vec)
        printf("%lu\n", x); 
    return 0;
}

Chondro answered 22/4, 2023 at 3:49 Comment(9)

rnd % size is far from uniform. Look at how std::uniform_int_distribution works. – Stefanistefania 22/4, 2023 at 21:16

rnd % size is far from uniform. This is true, but is not matter for this case. result of this - just some cells will be just swapped twice. But only one swap is enough for good randomization, so 2nd one is not matter. – Chondro 23/4, 2023 at 16:10

Also, if you really worry about rnd % size, I can modify code, and generate unified rnd-output. But it will make code more slowly, maybe ~2x in worst case. I think, better to keep as is. – Chondro 23/4, 2023 at 16:16

While for some bounds the results are really looking random and well-distributed, unfortunately your approach is unreliable w.r.t. uniformity. I elaborated on this in my now updated question. – Paramnesia 1/5, 2023 at 2:20

Of course, simple LCG is not good, for your deep tests. I stated about it in the original post. To make high quality uniform distribution, I already suggested you cryptographic RND generator, for example - RC4. – Chondro 1/5, 2023 at 4:50

No, LCG (or, in correct terms, affine_cipher) can be used to generate uniform distributions, for which I mentioned the other answer as an example (with pictures) in the question. The non-uniformity stems from how you obtain your distribution from the ciphered values. Using a cryptographically secured permutation does not change anything about that. – Paramnesia 1/5, 2023 at 7:2

Hm.. Maybe you right.. Can you test another approach with using rnd % i on your data suite? I updated code above. – Chondro 1/5, 2023 at 11:5

@Chondro There is no logical reason why using rnd % i instead of rnd % size would fix anything. There is a plotter and testing code mentioned in my question if you'd like to experiment around, but to actually solve the problem, one probably has to think a little harder and use mathematics. The question is harder than what you'd typically find on Stack Overflow, which I suppose is also why many people downvoted in frustration. – Paramnesia 3/5, 2023 at 7:43

OK, I removed %, used classic iterative probation. It decreases RND performance ~2x in worst case (avg 2 attempts per index), but distribution will be uniform. However, I still think, problem is not in %, but rather in LCG. Fro example, LCG always generates sequence odd-even-odd-even... What is not a random. – Chondro 3/5, 2023 at 16:40

The simplest is to just sort it using an LCG with a whole [0...2⁶⁴) unique mapping to itself, if LCG parameters are subject to Hull–Dobell theorem conditions. Good spectral parameters are taken from Computationally easy, spectrally good multipliers for congruential pseudorandom number generators.

You could easily adapt it to TBB vectors and parallel sorts.

Along the lines

#include <algorithm>
#include <cstdint>
#include <iostream>
#include <numeric>
#include <vector>

#define func auto

uint64_t m = 0xd1342543de82ef95ULL; // taken from https://arxiv.org/pdf/2001.05304.pdf
uint64_t c = 0x1ULL;

inline func lcg(uint64_t xi) -> uint64_t { // as LCG as it gets
    return m*xi + c;
}

inline func cmp_lcg(uint64_t a, uint64_t b) -> bool {
    return lcg(a) < lcg(b);
}

func distributeIndices(uint64_t from, uint64_t to) -> std::vector<uint64_t> {
    std::vector<uint64_t> z(to - from + 1);

    std::iota(z.begin(), z.end(), from);

    std::sort(z.begin(), z.end(), cmp_lcg);

    return z;
}

static const char NL = '\n';

func main() -> int {

    auto q = distributeIndices(7, 23);

    for(auto v: q)
        std::cout << v << " " << lcg(v) << NL;
    std::cout << NL;

    return 0;
}

Arnoldarnoldo answered 23/4, 2023 at 2:9 Comment(10)

Nice concept for parallelization. I used tbb::parallel_sort instead of std::sort for time measurement. Unfortunately, with 3666.21 ms (3 s 666.21 ms), it is still behind, but close. Sequentially, it took 15925.91 ms. Results are also looking good. – Paramnesia 23/4, 2023 at 6:30

@Paramnesia Curious if this will work better than TBB – Arnoldarnoldo 23/4, 2023 at 14:19

The parallelization of std:: methods is done by adapting the parallelization schema in use (which usually is Intel's oneTBB). OpenMP tends to be a tiny bit faster than TBB, but as a non-standard preprocessor hack without shared memory data structures, it is much less flexible than TBB, a C++ standard compliant library compatible with std::thread (which OpenMP is not). So I don't use OpenMP and would recommend TBB to anyone who wants to implement some sophisticated parallel behavior. Incidentally, even the note behind your link recommends against using non-standard parallel mode extensions. – Paramnesia 23/4, 2023 at 15:33

Well, parallel algorithms are now part of C++ standard, but quality of implementation still varies. cppstories.com/2018/11/parallel-alg-perf – Arnoldarnoldo 23/4, 2023 at 15:36

So if TBB is about the same speed wise, stay with TBB – Arnoldarnoldo 23/4, 2023 at 15:37

The C++17 algorithms are exactly what I was talking about. The std::execution::par actually has no effect when not having a parallelization library around, which they can inject. So they are not faster than TBB when using TBB. But I wouldn't recommend using them, since they mess up exception handing in threads. Possibly that is a GCC and LLVM specific defect, I don't know what the standard says (e.g. an exception thrown in a std::for_each(...) running in parallel is not caught in the main thread). But Intel really did a much better job at implementing shared memory parallelization. – Paramnesia 23/4, 2023 at 15:45

Also might be worthwhile to check how this approach and parallel sort will behave with growing number of CPUs/cores. Together with parallel std::iota – Arnoldarnoldo 24/4, 2023 at 14:47

Your code performs much faster with tbb::parallel_sort when cmp_lcg is turned into a lambda. It should be noted, though, that upon visual inspection the results look too ordered to suggest randomness. I mentioned these points and also provided some measured durations based on different core amounts in my now updated question. Instead of std::iota I used tbb::parallel_for, but it did not give noticeably better timings on my laptop. PS: Sorry for the late reply, I was suspended for a week for being impolite about close votes. – Paramnesia 1/5, 2023 at 2:22

You could save some time constructing empty vector, then reserve(), then iota. Now in constructor there is a loop initializing with default value. You need just allocation, iota will do initialization anyway. Lambda being faster is probably a failure of the optimizer – Arnoldarnoldo 1/5, 2023 at 3:0

Concerning your picture, you got LCG lattice structure, could check it in Wiki – Arnoldarnoldo 1/5, 2023 at 19:14

When something appears ordered, i.e. non-random, it means that it follows some recognizable patterns. Elements of a pattern repeat in a predictable manner. Therefore, by the nature of what a pattern is, it should be clear that the distributions of any cryptographic hash function (CHF) should not give away any patterns, because that would make it less secure.
For the same reason, such distributions must also be uniform, since otherwise there would be subsets of hashes that would be easier to attack.
Consequently, it seems like a safe bet to utilize some cryptographic hash function in order to meet the requirements of uniformity and random appearance. But cryptographic hash functions tend to be slow, and more work has to be done since their distributions are not bijective.

So while the following approach has not really great performance, it is meant to show that there are at least okay-ish solutions that fulfill the specification.

One of the fastest widespread CHFs is MD5, which is used for the sake of this demonstration. It is well-defined, and therefore conforms with the "same results regardless of the used C++ implementation" requirement. It can be nicely adapted to the problem because it provides 128-bit hashes, which can be treated as single numbers in C++ (via GNU __uint128_t, or by using a library such as @ckormanyos/wide-integer).

I tested with two implementations: OpenSSL [MinGW-w64 package] and @animetosho/md5-optimisation.
OpenSSL turned out be be around 30% slower. It probably was not designed to operate on such small inputs.

Newly created md5.h:

#ifndef _MD5_H
#define _MD5_H

#include <cstdint>
#include <ostream>

typedef __uint128_t uint128_t;
uint128_t md5(std::uint64_t num);
std::ostream& operator<<(std::ostream& out, uint128_t x); // in case you want to print uint128_t values

#endif // _MD5_H

Inserted into md5-optimisation's md5.cpp:

#include "md5.h"
uint128_t md5(std::uint64_t num) {
    MD5_STATE<std::uint32_t> hash;
    md5<std::uint32_t, md5_block_noleag>(&hash, &num, sizeof(std::uint64_t));
    return (uint128_t(hash.A) << 96) | (uint128_t(hash.B) << 64) | (uint128_t(hash.C) << 32) | hash.D;
}
std::ostream& operator<<(std::ostream& out, uint128_t x) {
    if (x >= 10)
        out << x / 10;
    return out << static_cast<unsigned>(x % 10);
}
// + main function removed

Note that the library contains numerous strategies. md5_block_noleag turned out fastest on my system for this use case.

Inserted into main.cpp (testing code from the question):

#include "md5.h"
#include <openssl/md5.h>
using namespace std;
#define S(X,n) (uint128_t(X) << 8 * n)
#define N(X) (S(X[ 3], 15) | S(X[ 2], 14) | S(X[ 1], 13) | S(X[ 0], 12) \
            | S(X[ 7], 11) | S(X[ 6], 10) | S(X[ 5],  9) | S(X[ 4],  8) \
            | S(X[11],  7) | S(X[10],  6) | S(X[ 9],  5) | S(X[ 8],  4) \
            | S(X[15],  3) | S(X[14],  2) | S(X[13],  1) | S(X[12],  0))
// e.g. MD5-hash(42) = "e8bdc34458096e6d18755f4c86df8b95" ; NOTE: MD5-hash(42) != MD5-hash("42")
// little-endian: e8bdc344 58096e6d 18755f4c 86df8b95 ; 309366199612997410650608664883390024597
// big-endian:    44c3bde8 6d6e0958 4c5f7518 958bdf86 ;  91403853391004256260858066647535837062 <- numeric value
uint128_t openssl_md5(uint64_t num) {
    unsigned char hash[MD5_DIGEST_LENGTH]; // length 16
    MD5_CTX md5;
    MD5_Init(&md5);
    MD5_Update(&md5, &num, sizeof(uint64_t));
    MD5_Final(hash, &md5);
    return N(hash);
}

Now, to convert a 64-bit integer into its MD5-hashed 128-bit integer, one could use md5 or openssl_md5.

The idea follows the approach to concurrently sort integers w.r.t. their hashed values.

auto distributeIndices(uint64_t from, uint64_t to) -> vector<uint64_t> {
    auto cmp_md5 = [](uint64_t a, uint64_t b) { return md5(a) < md5(b); };
    uint64_t size = to - from + 1;
    vector<uint64_t> v(size);
    tbb::parallel_for(uint64_t(0), size, [&](uint64_t i) {
        v[i] = from + i;
    });
    tbb::parallel_sort(v.begin(), v.end(), cmp_md5);
    return v;
}

Unless a faster way to utilize hash functions while keeping uniformity and establishing bijectivity is found, the task boils down to finding faster performing hash functions whose distributions do not look ordered (which should include all cryptographic hash functions).
So it seems that this question implicitly focuses on cryptography a lot more than I anticipated.

Computing 100 million indices took my i7-3610QM 153252.76 ms (2 min 33 s 252.76 ms).
For visual inspection, here is an exemplary distribution obtained this way, with plots:

average course of distributeIndices(42, 42+99999999) with polynomial trendline:
front ; distributeIndices(42, 42+99999999)[0, ..., 1999] with polynomial trendline:
middle ; distributeIndices(42, 42+99999999)[49999000, ..., 50000999] with polynomial trendline:
back ; distributeIndices(42, 42+99999999)[99998000, ..., 99999999] with polynomial trendline: