Distributions and internal state
Asked Answered
I

1

22

On Stackoverflow there are many questions about generating uniformly distributed integers from a-priory unknown ranges. E.g.

The typical solution is something like:

inline std::mt19937 &engine()
{
  thread_local std::mt19937 eng;
  return eng;
}

int get_int_from_range(int from, int to)
{
  std::uniform_int_distribution<int> dist(from, to);
  return dist(engine());
}

Given that a distribution should be a lightweight object and there aren't performance concerns recreating it multiple times, it seems that even simple distribution may very well and usually will have some internal state.

So I was wondering if interfering with how the distribution works by constantly resetting it (i.e. recreating the distribution at every call of get_int_from_range) I get properly distributed results.

There's a long discussion between Pete Becker and Steve Jessop but without a final word. In another question (Should I keep the random distribution object instance or can I always recreate it?) the "problem" of the internal state doesn't seem very important.

Does the C++ standard make any guarantee regarding this topic?

Is the following implementation (from N4316 - std::rand replacement) somewhat more reliable?

int get_int_from_range(int from, int to)
{
  using distribution_type = std::uniform_int_distribution<int>;
  using param_type = typename distribution_type::param_type;

  thread_local std::uniform_int_distribution<int> dist;
  return dist(engine(), param_type(from, to));    
}

EDIT

This reuses a possible internal state of a distribution but it's complex and I'm not sure it does worth the trouble:

int get_int_from_range(int from, int to)
{
  using range_t = std::pair<int, int>;
  using map_t = std::map<range_t, std::uniform_int_distribution<int>>;

  thread_local map_t range_map;

  auto i = range_map.find(range_t(from, to));
  if (i == std::end(range_map))
    i = range_map.emplace(
          std::make_pair(from, to),
          std::uniform_int_distribution<int>{from, to}).first;

  return i->second(engine());
}

(from https://mcmap.net/q/658349/-thread-local-static-uniform_int_distribution-will-it-be-recreated-for-different-sets-of-min-max)

Ingenuous answered 7/5, 2015 at 14:0 Comment(6)
Where do you get "even a simple distribution [...] usually will have some internal state" from that link?Expulsive
For practical purposes, why take the risk? Using only one generator (instead of a new one every function call) is already clear, so what´s the problem with using only one distribution too? Maybe the other solution will work flawlessly too, but you have already a solution that is guaranteed to work.Abele
@Yakk the incipit of Joseph Mansfield's answer: "A distribution may very well and usually will have some state"Ingenuous
Sure, but 1 and 2 just refer to immutable (up to a new param_type) state, and do not "really" make operator() change the state. The "some internal state" we care about would be the state not dependent on param_type -- the #3 case in your link. All your code would do is make any chained dependency on previous operator() calls be maintained. Simple distributions are not likely to have such dependencies (ie, their reset() is going to do nothing, and two constructed instances with the same param_type are going to be identical).Expulsive
@Abele I can reuse the same distribution if I always need number from the same range. For a changing range I need to change the param_type of the existing distribution (this is like a reset()) or a new distribution. Using the same distribution is possible but complex (see the edit of the question).Ingenuous
@Yakk Thank you. This was also my impression (confirmed checking the gcc implementation of reset()) but Pete Becker's answer has raised some doubts.Ingenuous
E
5

Interesting question.

So I was wondering if interfering with how the distribution works by constantly resetting it (i.e. recreating the distribution at every call of get_int_from_range) I get properly distributed results.

I've written code to test this with uniform_int_distribution and poisson_distribution. It's easy enough to extend this to test another distribution if you wish. The answer seems to be yes.

Boiler-plate code:

#include <random>
#include <memory>
#include <chrono>
#include <utility>

typedef std::mt19937_64 engine_type;

inline size_t get_seed()
    { return std::chrono::system_clock::now().time_since_epoch().count(); }

engine_type& engine_singleton()
{  
    static std::unique_ptr<engine_type> ptr;

    if ( !ptr ) 
        ptr.reset( new engine_type(get_seed()) );
    return *ptr;
}

// ------------------------------------------------------------------------

#include <cmath>
#include <cstdio>
#include <vector>
#include <string>
#include <algorithm>

void plot_distribution( const std::vector<double>& D, size_t mass = 200 )
{
    const size_t n = D.size();
    for ( size_t i = 0; i < n; ++i ) 
    {
        printf("%02ld: %s\n", i, 
            std::string(static_cast<size_t>(D[i]*mass),'*').c_str() );
    }
}

double maximum_difference( const std::vector<double>& x, const std::vector<double>& y )
{
    const size_t n = x.size(); 

    double m = 0.0;
    for ( size_t i = 0; i < n; ++i )
        m = std::max( m, std::abs(x[i]-y[i]) );

    return m;
}

Code for the actual tests:

#include <iostream>
#include <vector>
#include <cstdio>
#include <random>
#include <string>
#include <cmath>

void compare_uniform_distributions( int lo, int hi )
{
    const size_t sample_size = 1e5;

    // Initialize histograms
    std::vector<double> H1( hi-lo+1, 0.0 ), H2( hi-lo+1, 0.0 );

    // Initialize distribution
    auto U = std::uniform_int_distribution<int>(lo,hi);

    // Count!
    for ( size_t i = 0; i < sample_size; ++i )
    {
        engine_type E(get_seed());

        H1[ U(engine_singleton())-lo ] += 1.0;
        H2[ U(E)-lo ] += 1.0;
    }

    // Normalize histograms to obtain "densities"
    for ( size_t i = 0; i < H1.size(); ++i )
    {
        H1[i] /= sample_size; 
        H2[i] /= sample_size; 
    }

    printf("Engine singleton:\n"); plot_distribution(H1);
    printf("Engine creation :\n"); plot_distribution(H2);
    printf("Maximum difference: %.3f\n", maximum_difference(H1,H2) );
    std::cout<< std::string(50,'-') << std::endl << std::endl;
}

void compare_poisson_distributions( double mean )
{
    const size_t sample_size = 1e5;
    const size_t nbins = static_cast<size_t>(std::ceil(2*mean));

    // Initialize histograms
    std::vector<double> H1( nbins, 0.0 ), H2( nbins, 0.0 );

    // Initialize distribution
    auto U = std::poisson_distribution<int>(mean);

    // Count!
    for ( size_t i = 0; i < sample_size; ++i )
    {
        engine_type E(get_seed());
        int u1 = U(engine_singleton());
        int u2 = U(E);

        if (u1 < nbins) H1[u1] += 1.0;
        if (u2 < nbins) H2[u2] += 1.0;
    }

    // Normalize histograms to obtain "densities"
    for ( size_t i = 0; i < H1.size(); ++i )
    {
        H1[i] /= sample_size; 
        H2[i] /= sample_size; 
    }

    printf("Engine singleton:\n"); plot_distribution(H1);
    printf("Engine creation :\n"); plot_distribution(H2);
    printf("Maximum difference: %.3f\n", maximum_difference(H1,H2) );
    std::cout<< std::string(50,'-') << std::endl << std::endl;

}

// ------------------------------------------------------------------------

int main()
{
    compare_uniform_distributions( 0, 25 );
    compare_poisson_distributions( 12 );
}

Run it here.


Does the C++ standard make any guarantee regarding this topic?

Not that I know of. However, I would say that the standard makes an implicit recommendation not to re-create the engine every time; for any distribution Distrib, the prototype of Distrib::operator() takes a reference URNG& and not a const reference. This is understandably required because the engine might need to update its internal state, but it also implies that code looking like this

auto U = std::uniform_int_distribution(0,10);
for ( <something here> ) U(engine_type());

does not compile, which to me is a clear incentive not to write code like this.


I'm sure there are plenty of advice out there on how to properly use the random library. It does get complicated if you have to handle the possibility of using random_devices and allowing deterministic seeding for testing purposes, but I thought it might be useful to throw my own recommendation out there too:

#include <random>
#include <chrono>
#include <utility>
#include <functional>

inline size_t get_seed()
    { return std::chrono::system_clock::now().time_since_epoch().count(); }

template <class Distrib>
using generator_type = std::function< typename Distrib::result_type () >;

template <class Distrib, class Engine = std::mt19937_64, class... Args>
inline generator_type<Distrib> get_generator( Args&&... args )
{ 
    return std::bind( Distrib( std::forward<Args>(args)... ), Engine(get_seed()) ); 
}

// ------------------------------------------------------------------------

#include <iostream>

int main()
{
    auto U = get_generator<std::uniform_int_distribution<int>>(0,10);
    std::cout<< U() << std::endl;
}

Run it here. Hope this helps!

EDIT My first recommendation was a mistake, and I apologise for that; we can't use a singleton engine like in the tests above, because this would mean that two uniform int distributions would produce the same random sequence. Instead I rely on the fact that std::bind copies the newly-created engine locally in std::function with its own seed, and this yields the expected behaviour; different generators with the same distribution produce different random sequences.

Ezana answered 20/5, 2015 at 16:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.