A parallel algorithm for order-preserving selection from an index table

Asked 13/6, 2013 at 14:21 Answered 13/6, 2013 at 16:12

Solved c++multithreading algorithm openmp tbb

Order-preserving selection from an index table is trivial in serial code, but in multi-threading is less straightforward, in particular if one wants to retain efficiency (the whole point of multi-threading) by avoiding linked lists. Consider the serial code

template<typename T>
std::vector<T> select_in_order(
  std::vector<std::size_t> const&keys, // permutation of 0 ... key.size()-1
  std::vector<T> const&data)           // anything copyable
{ // select data[keys[i]] allowing keys.size() >= data.size()
  std::vector<T> result;
  for(auto key:keys)
    if(key<data.size())
      result.push_back(data[key]);
  return result;
}

How can I do this multi-threaded (say using TBB or even OpenMP), in particular if data.size() < key.size()?

Alcoholism answered 13/6, 2013 at 14:21 Comment(0)

The parallel computing operation you're looking for is called Stream Compaction.

Stream compaction

It can be implemented efficiently in parallel, though the algorithm is non-trivial. Your best bet would be to use a library which implements it already, such as Thrust. If you truly want to implement yourself, though, an explanation of the algorithm can be found in GPU Programming Chapter 39.3.1, or alternatively, in Udacity's Intro to Parallel Programming course, Lesson 4.5.

Essentially, it involves defining a boolean predicate for your array (in your example, key<data.size()), mapping it to a separate array, taking the Scan over the predicate array, then doing a Scatter.

Map() and Scatter() are easy to implement in parallel; the implementation of Scan() is the non-trivial part. Most parallel libraries will have a Scan() implementation; if not, the above links both describe several parallel scan algorithms.

This is all assuming you have many cores, like on a GPU. On a CPU, it would probably just be faster to do it serially; or to divide the array into large chunks, process the chunks serially (on different cores in parallel), and merge the results back together. Which approach is best depends on your data (the former works better if most keys are expected to be in the final array).

Fidelafidelas answered 13/6, 2013 at 16:12 Comment(2)

+1 for all the references (though they are a bit too heavy on GPUs) and the mentioning the stream compaction algorithm. – Alcoholism 14/6, 2013 at 8:32

Note that tbb also has parallel_scan(). – Alcoholism 14/6, 2013 at 8:41

Partition your keys among your threads, e.g. with N threads you would give T1 the keys {0, key.size() / N - 1}, T2 gets the keys {key.size() / N, 2 * key.size() / N - 1}, etc., and TN gets the keys {(N - 1) / N * keys.size(), keys.size() - 1}. Each thread puts its results in a thread-local container, and you merge the containers when the threads have finished. This way you don't need to perform any synchronization between the threads.

The most efficient way to merge the containers is for the containers to be linked lists, since it's easy to append T2's list to T1's list and so on. However, as you said it's a good idea to avoid linked lists since they don't parallelize well.

Another option is to have each thread store its results in a thead-local array, and to then merge these arrays when the threads have completed; you can perform this merge in parallel (the size of each thread's results is T{N}results.size(); given the final merge array final_results, T1 merges its data to final_results[0, T1results.size()-1], T2 merges its data to final_results[T1results.size(), T1results.size() + T2results.size()-1], T3 merges its results to final_results[T1results.size() + T2results.size(), T1results.size() + T2results.size + T3results.size()-1], etc).

Another option is to use a shared concurrent_hash_map from TBB, with key as the key and data[key] as the value.

Interfuse answered 13/6, 2013 at 15:0 Comment(5)

hmmm I have to ponder about these options. Definitely some ideas here I havn't thought about. – Alcoholism 13/6, 2013 at 15:28

How would I implement your approach using tbb? I cannot use parallel_for(), because then each thread will have a random collection of keys to work on, not an ordered block. – Alcoholism 14/6, 2013 at 8:17

@Alcoholism Use a nested loop; the outer loop is a parallel_for loop i = 0 to keys.size incrementing by keys.size/N (where N is the number of threads you want), and the inner loop is a standard for loop j = i to (i + keys.size/N) incrementing by 1. Use an array of arrays to store the results: an array of length N of arrays of length keys.size/N. Use a second counter in the outer loop k = 0 to N, and have the inner loops store their results in the array indexed at arrays[k]. To merge the results arrays you'd just need a single parallel_for loop – Mcginty 14/6, 2013 at 13:31

@Alcoholism you can use an array of vectors instead, and instead of a second counter you can use arrays[i / (keys.size / N)] – Mcginty 14/6, 2013 at 16:28

How can I control the number N of threads used by parallel_for()? – Alcoholism 15/6, 2013 at 18:22

Recommended topics

Hot tags