Random element from unordered_set in O(1)
Asked Answered
I

4

20

I've seen people mention that a random element can be grabbed from an unordered_set in O(1) time. I attempted to do so with this:

std::unordered_set<TestObject*> test_set;

//fill with data

size_t index = rand() % test_set.size();
const TestObject* test = *(test_set.begin() + index);

However, unordered_set iterators don't support + with an integer. begin can be given a size_t param, but it is the index of a bucket rather than an element. Randomly picking a bucket then randomly picking an element within it would result in a very unbalanced random distribution.

What's the secret to proper O(1) random access? If it matters, this is in VC++ 2010.

Inconsequent answered 6/10, 2012 at 15:58 Comment(8)
If the set is unordered, then the element pointed to by begin() is random isn't it?Unilingual
Somewhat, but it will be the same element each time. I'm also not sure how random it really is.Inconsequent
@nikhil: the "order" in which elements are stored in an unordered set is very likely to be deterministic, not random. (i.e. if you construct "the same" set twice, you're very likely to get the same "order")Romalda
Any references for "I've seen people mention..."?Berceuse
Here's one mention in the 2nd comment to OP: #8302731Inconsequent
This can be done, but not with an unordered_set by itself. You'll need to maintain an auxiliary vector of the items in the set.Berceuse
@Unilingual - "I don't know how to figure it out" is not the same as random.Faldstool
Possible duplicate of How to select a random element in std::set in less than O(n) time?Bunsen
A
8

I believe you have misinterpreted the meaning of "random access", as it was used in those cases you're referring to.

"Random access" doesn't have anything to do with randomness. It means to access an element "at random", i.e. access any element anywhere in the container. Accessing an element directly, such as with std::vector::operator[] is random access, but iterating over a container is not.

Compare this to RAM, which is short for "Random Access Memory".

Achievement answered 6/10, 2012 at 16:11 Comment(1)
I linked one comment specifically in the context of finding a random element. Of course, it's not like a random SO comment is gospel, but I was wondering if it could be done. Also, I'm pretty sure you can pick a random element out of RAM if you know its size, so I dunno if it's a good example. I can see how something could be random access but difficult/impossible to perform O(1) random element grabbing, though. But which is the case for unordered_set?Inconsequent
G
11

std::unordered_set has no O(1) random access in the sense of an array. It is possible to access an element, based on key, in O(1) but it is impossible to find the k-th element.

Despite that, here is a way to get a random element with a uniform distribution from std::unordered_map (or with std::unordered_set if the key has a mutable field). I have laid out a similar technique in an answer to SO question Data Structure(s) Allowing For Alteration Through Iteration and Random Selection From Subset (C++).

The idea is to supplement each entry in std::unordered_set with a mutable index value into a vector of pointers into the unordered_set. The size of the vector is the size of the unordered_set. Every time a new element is inserted to the unordered_set, a pointer to that element is push_back-ed into the vector. Every time an element is erased from the unordered_set, the corresponding entry in the vector is located in O(1), and is swapped with the back() element of the vector. The index of the previously back() element is amended, and now points to its new location in the vector. Finally the old entry is pop_back()-ed from the vector.

This vector points exactly to all elements in the unordered_set. It takes O(1) to pick a random element from the combined structure in uniform distribution. It takes O(1) to add or erase an element to the combined structure.

NOTE: Pointers to elements (unlike iterators) are guaranteed to stay valid as long as the element exists.

Here is how this should look: three elements in the set

For erasing element c:

  1. swap element c_index and a_index and fix the pointers to them:
  2. pop_back last element, which is element_c from the vector.
  3. erase c from the unordered_set.

Randomization is trivial - simply select an element at random from the vector.

EDIT: Here is a partial code that can return a uniformly-distributed random element from an unordered_set. I had to do some things slightly different than in my explanations above, since there is no reliable indexing (or iterators) into unordered_set. The thing that makes it impossible to hold iterators into the unordered_set is that its elements are getting rehashed from time to time, invalidating all iterators in the process. So, instead of stable indexing, this solution is simply using pointers into an object that is never reallocated:

#include <unordered_set>
#include <functional>
#include <vector>
#include <memory>
#include <random>


template <class T>
class RandomUnorderedSet
{
private:
   struct Entry {
       Entry(const T & data_in, unsigned index_in_vector_in)
       : data(data_in), index_in_vector(index_in_vector_in) 
       {}
       T data;
       unsigned index_in_vector;
   };
   struct PtrEntryHash {
       auto operator()(const std::unique_ptr<Entry> & entry) const 
       { 
           return std::hash<T>()(entry->data);
       }
   };
   struct PtrEntryEqual {
       bool operator()(const std::unique_ptr<Entry> & a, 
                       const std::unique_ptr<Entry> & b ) const 
       { 
           return a->data == b->data;
       }
   };
public:
   bool insert(const T & element)
   {
       auto entry_ptr = std::make_unique<Entry>(element, m_entry_vector.size());
       if (m_entry_set.count(entry_ptr) > 0)
          return false;
       m_entry_vector.push_back(entry_ptr.get());
       try {
            m_entry_set.insert(std::move(entry_ptr));
       } catch(...) {
           m_entry_vector.pop_back();
           throw;
       }
       return true;
   }

   // Return the number of elements removed
   int erase(const T & element)
   {
       auto it = m_entry_set.find(element);
       if (it == m_entry_set.end())
          return 0;
       auto swap_with = it->index_in_vector;
       if (swap_with < m_entry_vector.size() - 1) {
           m_entry_vector.back()->index_in_vector = swap_with;
           m_entry_vector[swap_with] = m_entry_vector.back();
       }
       m_entry_set.erase(it);
       m_entry_vector.pop_back();
       return 1;
   }
   template <typename RandomGenerator>
   const T & random_element(RandomGenerator & r)
   {
       std::uniform_int_distribution<> dis(0, m_entry_vector.size() - 1);
       return m_entry_vector[dis(r)]->data;

   }

private:
   std::unordered_set<std::unique_ptr<Entry>, PtrEntryHash, PtrEntryEqual> 
        m_entry_set;
   std::vector<Entry*> m_entry_vector;
};

Notes:

  • This implementation is just a skeleton, where additional operations might be added.
  • If this is to be a library class, then it is best to make it a proper container, with an iterator type, which hides the implementation details, and with begin() and end() calls, and with a better return type for insert().
Gorgeous answered 22/8, 2018 at 19:27 Comment(3)
"Every time an element is erased from the unodrered_set, the corresponding entry in the vector is located in O(1)", can you explain how would you locate it in O(1) in the vector?Jenjena
@Jenjena each element in the set maintains its index in the array, so getting the index is a matter of a trivial lookupGorgeous
@Jenjena I have added a skeleton implementation to the answer, you can look for details there.Gorgeous
A
8

I believe you have misinterpreted the meaning of "random access", as it was used in those cases you're referring to.

"Random access" doesn't have anything to do with randomness. It means to access an element "at random", i.e. access any element anywhere in the container. Accessing an element directly, such as with std::vector::operator[] is random access, but iterating over a container is not.

Compare this to RAM, which is short for "Random Access Memory".

Achievement answered 6/10, 2012 at 16:11 Comment(1)
I linked one comment specifically in the context of finding a random element. Of course, it's not like a random SO comment is gospel, but I was wondering if it could be done. Also, I'm pretty sure you can pick a random element out of RAM if you know its size, so I dunno if it's a good example. I can see how something could be random access but difficult/impossible to perform O(1) random element grabbing, though. But which is the case for unordered_set?Inconsequent
K
6

std::unordered_set don't provide a random access iterator. I guess it's a choice from the stl designers to give stl implementers more freedom...the underlying structure have to support O(1) insertion and deletion but don't have to support random access. For example you can code an stl-compliant unordered_set as a doubly linked list even though it's impossible to code a random access iterator for such an underlying container.

Getting a perfectly random element is then not possible even though the first element is random because the way the elements are sorted by hash in the underlying container is deterministic...And in the kind of algorithm that I am working on, using the first element would skew the result a lot.

I can think of a "hack", if you can build a random value_type element in O(1)... Here is the idea :

  1. check the unordered set in not empty (if it is, there is no hope)
  2. generate a random value_type element
  3. if already in the unordered set return it else insert it
  4. get an iterator it on this element
  5. get the random element as *(it++) (and if *it is the last element the get the first element)
  6. delete the element you inserted and return the value in (5)

All these operations are O(1). You can implement the pseudo-code I gave and templatize it quite easily.

N.B : The 5th step while very weird is also important...because for example if you get the random element as it++ (and it-- if it is the last iterator) then the first element would be twice less probable than the others (not trivial but think about it...). If you don't care about skewing your distribution that's okay you can just get the front element.

Kisor answered 20/7, 2015 at 17:27 Comment(2)
Would this always be random?Delanty
@DonCode I will skew the distribution a lot because the probability of picking one element is directly proportional to the number of empty buckets (+1) before it. I have coded a better solution here: gist.github.com/matovitch/1c262d2d870024ca0b81Kisor
M
1

I wrote a solution using buck_count() and cbegin(n) methods, to choose a bucket at random, and then choose an element at random in the bucket.

Two problems: - this is not constant time (worse case with a lot of empty buckets and many elements in one bucket) - the probability distribution is skewed

I think that the only way to peek an element at random is to maintain a separate container providing a random access iterator.

#include <random>
#include <iostream>
#include <unordered_set>
#include <unordered_map>
#include <cassert>

using namespace std;

ranlux24_base randomEngine(5);

int rand_int(int from, int to)
{
    assert(from <= to);

    return uniform_int_distribution<int>(from, to)(randomEngine);
}

int random_peek(const unordered_set<int> & container)
{
    assert(container.size() > 0);

    auto b_count = container.bucket_count();
    auto b_idx = rand_int(0, b_count - 1);
    size_t b_size = 0;

    for (int i = 0; i < b_count; ++i)
    {
        b_size = container.bucket_size(b_idx);
        if (b_size > 0)
            break;

        b_idx = (b_idx + 1) % b_count;
    }

    auto idx = rand_int(0, b_size - 1);

    auto it = container.cbegin(b_idx);

    for (int i = 0; i < idx; ++i)
    {
        it++;
    }

    return *it;
}

int main()
{
    unordered_set<int> set;

    for (int i = 0; i < 1000; ++i)
    {
        set.insert(rand_int(0, 100000));
    }

    unordered_map<int,int> distribution;

    const int N = 1000000;
    for (int i = 0; i < N; ++i)
    {
        int n = random_peek(set);
        distribution[n]++;
    }

    int min = N;
    int max = 0;

    for (auto & [n,count]: distribution)
    {
        if (count > max)
            max = count;
        if (count < min)
            min = count;
    }

    cout << "Max=" << max << ", Min=" << min << "\n";
    return 0;
}
Microscopy answered 15/11, 2017 at 10:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.