C++ iterate vector randomly

Asked 13/10, 2015 at 8:4 Answered 14/10, 2015 at 10:15

Solved c++multithreading algorithm vector random-access

I'm working on a multithreaded program where all threads share some vector (read-only). The goal of each thread is to walk the entire vector. Nonetheless, all threads must visit this vector in a different way.

Since the vector is const and shared among all threads, i cannot use random_shuffle and just iterate over it. For now my solution is to build a crossref vector that will contain indices over the shared vector and then shuffle this vector, i.e.

     std::vector<int> crossref(SIZE) ; // SIZE is the size of the shared vector
     std::iota (std::begin(crossref), std::end(crossref), 0); // Fill with indices ref 
     std::mt19937 g(SEED); // each thread has it own seed.
     std::shuffle (crossref_.begin(), crossref_.end(), g); // Shuffle it

Nonetheless, doing this reveal some problems (1) it is not very efficient since every thread needs to access its crossref vector before accessing the shared one, (2) i have some performances issue because of the amount of memory required : the shared vector is very big and i have a lot of thread and processors.

Does anyone has some improvement ideas that will avoid the need of extra memory?

Arruda answered 13/10, 2015 at 8:4 Comment(14)

Accessing a std::vector is done in O(1), since it's random access. Also you are not guaranteed that all threads will have different crossref std::vector, so it may happen, that two threads will iterate over a vector in same manner. – Convolve 13/10, 2015 at 8:7

I would use a single shuffled index stack shared by all threads which is guarded against concurrent access. – Madaras 13/10, 2015 at 8:9

@Convolve - Sure, but the problem is the shared vector almost fit the cache, so every time a thread access the crossref vector it will invalidate the caches and this is not efficient.. – Arruda 13/10, 2015 at 8:10

I also may have mis-read the assignment. Apparently each vector must touch each element, and each one in a different oder. That necessarily requires some data struct for each thread. Potentially only a bool, but still. – Madaras 13/10, 2015 at 8:11

Do you really need the order to be completely random, or just different for different threads? – Sheilahshekel 13/10, 2015 at 8:11

@PeterSchneider that would seem to conflict with the requirement The goal of each thread is to walk the entire vector. Ah. You found out yourselves :-) – Vallejo 13/10, 2015 at 8:11

does each thread need to traverse the entire vector or are they load sharing? – Verst 13/10, 2015 at 8:11

@PeterSchneider that would defeat the purpose of multithreading because each thread needs its own permutation. Every other thread will have to wait that the current thread uses its permutation. – Skipbomb 13/10, 2015 at 8:12

@PeterSchneider a shared index stack is going to incur lock penalties. I should think this will be more costly than separate index vectors. – Verst 13/10, 2015 at 8:13

And the order for each thread has not only to be different, but generated by a mersenne twister? With simple things like different start offsets per thread it would be easier (or "possible") – Contrition 13/10, 2015 at 8:14

@ Peter Schneider - Your solution is quite elegant but all thread must visit the whole vector, if i understand well your solution, each element of the vector will be visited only once... @ Zereges - It's not a problem if some threads are common sequences since the whole walk of each thread is different – Arruda 13/10, 2015 at 8:14

@Skipbomb exactly I need a permutation by thread! – Arruda 13/10, 2015 at 8:19

Well, if common sequences are not a problem as long as there is something different, just use an offset per thread. With an array of n elements, each thread iterales the indices x, x+1, x+2 ... n, 0, 1, 2 ... x-1, and each thread has a different x. – Contrition 13/10, 2015 at 8:20

@Contrition i don't care about the generator. You can basically see my problem as a graph exploration where all threads have to visit the whole graph(the shared vector) in a different way. With your solution threads will end to explore it in the same ordre (modulo offset) – Arruda 13/10, 2015 at 8:22

You can use the algebraic notion of primitive root modulo n. Basically

If n is a positive integer, the integers between 1 and n − 1 that are coprime to n form the group of primitive classes modulo n. This group is cyclic if and only if n is equal to 2, 4, p^k, or 2p^k where p^k is a power of an odd prime number

Wikipedia displays how you can generate numbers below 7 using 3 as generator.

From this statement you derive an algorithm.

Take your number n
Find the next prime number m which is bigger than n
For each of your thread pick a unique random number F(0) between 2 and m
Compute the next index using F(i+1) = (F(i) * F(0)) mod m. If that index is within [0, n] range, access the element. If not go towards the next index.
Stop after m - 1 iterations (or when you obtain 1, it is the same thing).

Because m is prime, every number between 2 and m-1 is coprime to m so is a generator of the sequence {1 ... m}. You are guaranteed that no number will repeat in the first m - 1 steps, and that all m - 1 numbers will appear.

Complexity :

Step 2 : Done once, complexity equivalent to finding primes up to n, ie sieve of Eratosthenes
Step 3 : Done once, you can choose 2, 3 ,4, 5, etc... Which is as low as O(thread count)
Step 4 : O(m) time, O(1) in space per thread. You dont need to store the F(i). You only need to know first value and last value. This is the same properties as incrementation

Skipbomb answered 13/10, 2015 at 9:24 Comment(4)

Very elegant solution! – Vallejo 13/10, 2015 at 9:49

Exactly what I was thinking about... In cryptography this group is used a lot for asymetric encryption and signature algorithms, where we also want a pseudo-random permutation which should be completely different for each "key" (here F(0)). – Hume 13/10, 2015 at 11:34

Thank you, this is really elegant! – Arruda 13/10, 2015 at 19:23

There's something I'm not getting. Assume n to be 6, then m would be 7. If I now pick 2 as my F(0), I'll be only generating 1,2,4, since I'm always multiplying these by 2. What am I missing here? – Henson 29/3, 2019 at 18:36

If I understand well you want to generate a random permutation in a incremental way, i.e. you want to call n times a function f so that it generates all permuted numbers from 1 to n, so that function has constant memory.

I doubt it exists if you want to obtain an uniform distribution among the permutations, but you may be satisfied with a subset of the set of permutations.

If this is the case you can generate a permutation by taking a number p prime with n and calculate for each i in [1,n] : i.p (mod n). For example, if you have n=5 and p=7, then 7%5=2, 14%5=4, 21%5=1, 28%5=3, 35%5=0. You may combine several such functions to obtain something satisfying for you...

Tver answered 13/10, 2015 at 8:36 Comment(6)

You mean that if every thread has it own different p prime with n, every thread can iterate the whole vector with its permutation only by doing for each i in [1,n] : i.p (mod n) If this is the case, i can precompute a set of p offline easily, do i understand well? – Arruda 13/10, 2015 at 8:47

Yes, that's it. And if you think such a function does not flush sufficiently, then combine with offset starting point, or compute twice with two different primes. You can also go backward from n to 1, etc. Some different combination may fit well your needs. – Ammeter 13/10, 2015 at 8:54

Jean That will not work if the prime number p divide n. – Skipbomb 13/10, 2015 at 9:28

I suspect this will result in weak randomness. However, OP didn't say something about the quality of the randomness, only that the permutation has to be different among the threads, so this answer satisfies that. – Hume 13/10, 2015 at 11:36

Yes, it provides a poor randomness. – Ammeter 13/10, 2015 at 12:14

The function that @Arruda needs cannot be a good random generator: it looses part of its randomness each time it generate a sample. In effect the last sample has a conditional probability equal to 1. – Playreader 15/10, 2015 at 14:0

If memory is your biggest problem then you'll have to swap CPU cycles for memory space.

E.g. c++'s std::vector<bool> (http://en.cppreference.com/w/cpp/container/vector_bool) is a bit-array so quite memory efficient.

Each thread could have its own vector<bool> indicating wether or not it has visited a particular index. Then you'd have to use CPU cycles to randomly choose an index that it hasn't visited yet and terminate when all bools are true.

Vallejo answered 13/10, 2015 at 8:20 Comment(6)

And how is search a unsorted bool vector for "false" n times, ie. O(n^2), instead of n single array accesses, is going to make anything faster? – Contrition 13/10, 2015 at 8:21

Well, OP explicitly said that memory was his main problem. You can't have both space efficiency and CPU efficiency. – Vallejo 13/10, 2015 at 8:25

Note OP's question Does anyone has some improvement ideas that will avoid the need of extra memory? – Vallejo 13/10, 2015 at 8:26

Ok, point for you. And yes, making it faster and less memory-consuming seems pretty much impossible. – Contrition 13/10, 2015 at 8:27

I'm ready to be less CPU efficient if each thread can have its own permutation with a small extra memory .... – Arruda 13/10, 2015 at 8:29

Clearly the problem is with the last few iterations. You could improve the performance of the, let's say, last 10% of iterations by switching to an index-based version. When reaching the limit, collect the indices of all false-entries, permute them and use them as an index list. Compared to the original algorithm, this preserves the O(n) time but still reduces memory consumption. However, the time line will be less "contignous"; when switching the algorithm, the thread pauses its work for some significant time. This may be imperfect, depending on the application. – Hume 13/10, 2015 at 8:42

It seems this guy solved your problem in a very nice way.

This is what he says in the first line of the post: In this post I’m going to show a way to make an iterator that will visit items in a list in a random order, only visit each item once, and tell you when it’s visited all items and is finished. It does this without storing a shuffled list, and it also doesn’t have to keep track of which items it has already visited.

He leverages the power of a variable bit-lenght block cipher algorithm to generate each and every index in the array.

Playreader answered 14/10, 2015 at 10:15 Comment(2)

Indeed, and it seems that the permutation are better than in previous answers because it mixes algebraic and murmurhash. Am I right? – Arruda 14/10, 2015 at 14:25

It uses murmurhash as rounding function in the Feistel Network iterations, but he also states that you can use any other kind of elaboration. That hashing just gave him good results. – Playreader 14/10, 2015 at 15:37

This is not a complete answer but it should lead us to a correct solution.

You have written some things which we could take as assumptions:

(1) it is not very efficient since every thread needs to access its crossref vector before accessing the shared one,

This is unlikely to be true. We're talking about one indirect lookup. Unless your reference data is really a vector of ints, this will represent an infinitesimal part of your execution time. If your reference data is a vector of ints, then just make N copies of it and shuffle them...

(2) i have some performances issue because of the amount of memory required : the shared vector is very big and i have a lot of thread and processors.

How big? Did you measure it? How many discrete objects are there in the vector? How big is each one?

How many threads?

How many processors?

How much memory do you have?

Have you profiled the code? Are you sure where the performance bottleneck is? Have you considered a more elegant algorithm?

Verst answered 13/10, 2015 at 10:42 Comment(4)

I'm up to 128 threads with 128 dedicated cores. I work with a vector of int that doesn't fill the whole memory, but it's a fair simplification regarding my problem. I've used profiler and i 'm sure that this part is a bottleneck. I also use dedicated memory allocator. Have you considered a more elegant algorithm? Yes, but the algorithm is elegant, the problem was about coding efficiently this elegant algorithm. – Arruda 13/10, 2015 at 19:17

Do the values in the vector need to be ints? Could they be shorts? – Verst 14/10, 2015 at 0:6

No they can't... I also investigate more elaborated solutions with range compression but currently i'm still at experimentation stage. – Arruda 14/10, 2015 at 8:59

Difficult to comment further without knowing more about the problem domain. – Verst 14/10, 2015 at 9:2

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags