Data Structure(s) Allowing For Alteration Through Iteration and Random Selection From Subset (C++)
Asked Answered
B

1

1

Given a fixed-sized array A of objects, suppose a much smaller subset of these objects meet a certain criteria B. There are three tasks I want to complete with approximately equal frequency:

  1. I want to be able to change an object currently not meeting criteria B to meet criteria B at any time when accessing the object in A by index.
  2. I want to be able to change an object currently meeting criteria B to no longer meet criteria B when accessing the object in A by index.
  3. I also want to be able to choose a random object from ONLY those objects that meet criteria B.

All tasks should be able to be completed in constant time, or as close to constant time as possible, not dependent on the number of objects in A and not dependent on the number of objects meeting criteria B. If constant time isn't possible, which I suspect to be the case, then I want to complete both as quickly as possible, taking into consideration the frequencies I mentioned earlier. What data structure(s) are appropriate for these two tasks if they are being repeated huge numbers of times?

Take, for example, the C++ implementation that I have below. While the timed part (the section of the code that gets repeated a huge number of times) is independent of the overall size of A (alltiles), the time complexity linearly depends on B (bluetiles) (regardless of whether the number of alltiles increases or not), seriously slowing down the code.

#include <iostream>
#include <vector>
#include <chrono>
#include <cstdlib>
#include <algorithm>

using namespace std;

enum color {RED, GREEN, BLUE};
const int NUM_ATTEMPTS = 10000;
const int INITIAL_NUM_BLUE_TILES = 1000;
const int TOTAL_TILES = 1000000;

struct tile
{
  int color = RED;
};

struct room
{
  vector<tile> alltiles;
  vector<tile*> bluetiles;
  room(vector<tile> v) : alltiles(v) {}
};

int main()
{
  srand (time(NULL));

  // set up the initial room, time complexity here is irrelevant
  room myroom(vector<tile>(1*TOTAL_TILES));
  for(int i = 0; i < INITIAL_NUM_BLUE_TILES; i++)
  {
    myroom.alltiles[i].color = BLUE;
    myroom.bluetiles.push_back(&myroom.alltiles[i]);
  }

  auto begin = std::chrono::high_resolution_clock::now();
  for(int attempt_num = 0; attempt_num < NUM_ATTEMPTS; attempt_num++)
  {
    // access a BLUE tile by index from alltiles to change its color to RED
    myroom.alltiles[5].color = RED; // constant time
    myroom.bluetiles.erase(std::remove(myroom.bluetiles.begin(), myroom.bluetiles.end(), &myroom.alltiles[5]), myroom.bluetiles.end()); // linear time, oh no!

    // access a RED tile by index from alltiles to change its color to BLUE
    myroom.alltiles[5].color = BLUE; // constant time
    myroom.bluetiles.push_back(&myroom.alltiles[5]); // constant time

    // randomly choose from ONLY the blue tiles
    int rand_index = rand() % myroom.bluetiles.size(); // constant time
    myroom.bluetiles[rand_index]->color = GREEN; // constant time
    myroom.bluetiles[rand_index]->color = BLUE; // constant time
    // so now I have constant time access to a random blue tile

  }
  auto end = std::chrono::high_resolution_clock::now();
  double runtime = std::chrono::duration_cast<std::chrono::milliseconds>(end-begin).count();
  cout << runtime << " ms" << endl; 
  return 0;
}

The portions that are being timed are the operations that I am interested in performing frequently; in the real program, the logic behind choosing which tiles to change differs. Hopefully, a better data structure wouldn't require any probabilistic analysis, but I fear it still might.

I suspect that, perhaps, using double indirection by keeping a pointer (pointing to elements in the bluetiles vector) in the tile class could maybe allow me to achieve this in constant time, but I'm not certain. I guess it could at least speed it up in the sense that searching bluetiles would no longer be necessary, but then removal of an element in bluetiles would still be linear time (since I'm using a vector), so I'm really just not sure what to do here.

Can you devise the fastest data structure to achieve this goal, and provide a C++ implementation building from my example? Or is what I have as good as it will ever get?

Bureaucratize answered 22/8, 2018 at 6:41 Comment(8)
myroom.bluetiles.erase(std::remove(myroom.bluetiles.begin(), myroom.bluetiles.end(), &myroom.alltiles[5]), myroom.bluetiles.end()); is not good code. It is unspecified if the std::remove or the last myroom.bluetiles.end() will be executed first. So this function can behave in 2 different ways. It is specified in C++17, but not in C++11.Pratfall
@Pratfall It is unspecified in C++17, very likely UB prior to that.Matador
Use a hash table for your bluetiles, see if that works for you.Matador
why not using a hashset of integers which keeps indexes of objects meeting criteria B?Apparition
@PasserBy no, it is unspecified behaviour before C++17 and well-defined in C++17: https://mcmap.net/q/23927/-does-this-code-from-quot-the-c-programming-language-quot-4th-edition-section-36-3-6-have-well-defined-behaviorPratfall
I agree with @mangusta, this seems like a database indexing kind of problem. When you add/remove from set B, update the index. The index provides quick lookup at the expense of memory.Incompatible
@Pratfall You seem to have misunderstood. Evaluation of arguments are unsequenced prior C++17, and will very likely be undefined behaviour. They are indeterminately sequenced after C++17, which means they are sequenced one way or another, but unspecified which.Matador
@Incompatible 3rd operation is choose random item from bluetiles. std::unordered_set provides no random access iterator, making this operation linear. On the other hand, this: #12761815.Vignette
C
1

UPDATE: This resembles the solution I propose for the SO question Random element from unordered_set in O(1)

You can implement something like the following SubsetVector<T> class which lets you to insert/remove element from the subset (i.e. mark them) in O(1). Then it lets you find the size of the subset in O(1), and access i-th item from this subset in O(1). I think this is what you wanted. Note that the subset does not guarantee any specific order, but this should be OK for your needs.

The idea is to maintain two vectors.

  1. m_entries that contains the actual data. m_entries[i] contains the element and and an index into m_subset_indices, if the element is in the subset, and -1 otherwise.
  2. m_subset_indices contains all indices of m_entries elements that are in the subset.

Here is the code (compiled but not tested):

template <class T>
class SubsetVector
{
private:
   struct Entry
   {
       T element;
       int index_in_subset = -1;
   };
public:
   explicit SubsetVector(unsigned size = 0) : m_entries(size) 
   {
       m_subset_indices.reserve(size);
   }

   void push_back(const T & element)
   {
       m_entries.push_back(Entry{element, -1});
   }
   const T & operator[](unsigned index) const { return m_entries[index].element; }
   T & operator[](unsigned index) { return m_entries[index].element; }

   void insert_in_subset(unsigned index)
   {
       if (m_entries[index].index_in_subset < 0) {
           m_entries[index].index_in_subset = m_subset_indices.size();
           m_subset_indices.push_back(index);
       }
   }
   void erase_from_subset(unsigned index)
   {
       if (m_entries[index].index_in_subset >= 0) {
           auto subset_index = m_entries[index].index_in_subset;
           auto & entry_to_fix = m_entries[m_subset_indices.back()];
           std::swap(m_subset_indices[subset_index], m_subset_indices.back());
           entry_to_fix.index_in_subset = subset_index;
           m_subset_indices.pop_back();
           m_entries[index].index_in_subset = -1;
       }
   }
   unsigned subset_size() const 
   {
       return m_subset_indices.size();
   }
   T & subset_at(unsigned subset_index)
   {
       auto index = m_subset_indices.at(subset_index);
       return m_entries.at(index).element;
   }
   const T & subset_at(unsigned subset_index) const
   {
       auto index = m_subset_indices.at(subset_index);
       return m_entries.at(index).element;
   }

private:
   std::vector<Entry> m_entries;
   std::vector<unsigned> m_subset_indices;
};
Chewink answered 22/8, 2018 at 9:54 Comment(2)
Perfect! Although I didn't actually use your code, the idea of swapping the elements that I needed to use with the one at the end in combination with double indirection allowed me to achieve all three tasks in constant time! Fantastic idea.Bureaucratize
@ereHsaWyhsipS , yes, it's a great technique. I have been using it for decades. No idea where I have learned it from. Glad it helpedChewink

© 2022 - 2024 — McMap. All rights reserved.