How to parallelize std::partition using TBB
Asked Answered
D

4

7

Does anyone have any tips for efficiently parallelizing std::partition using TBB? Has this been done already?

Here is what I'm thinking:

  1. if the array is small, std::partition it (serial) and return
  2. else, treat the array as 2 interleaved arrays using custom iterators (interleave in cache-sized blocks)
  3. start a parallel partition task for each pair of iterators (recurse to step 1)
  4. swap elements between the two partition/middle pointers*
  5. return the merged partition/middle pointer

*I am hoping in the average case this region will be small compared to the length of the array or compared to the swaps required if partitioning the array in contiguous chunks.

Any thoughts before I try it?

Disown answered 28/5, 2014 at 23:21 Comment(8)
If you're using gcc, you could start with -fopenmp and defining _GLIBCXX_PARALLEL to use the parallel versions of the standard library that have already been written, tested, etc. If you really want to only run std::partition in parallel, you can include parallel/algorithm, and call __gnu_parallel::partition.Caston
this needs to complile with gcc and clang. also, the code doing the partition will already be running in a tbb task, will the openmp scheduler and the tbb scheduler play nicely together?Disown
I'd expect them to have put some effort into making them play nicely together, but I haven't tested it, so it's hard to be sure.Caston
No, OpenMP usually creates quadratic oversubscription if called inside outermost parallel loop (of TBB or smth else). It's just up to OpenMP specification. The level of oversubscription can be reduced via OMP_DYNAMIC mode, but still nesting two different runtimes is inefficient.Dental
BTW, if you don't want to depend on GCC's extensions but are ok to use GPL3 code, you can just bundle their implementation. You can also convert their code to TBB if you really want to avoid OpenMP. It looks to me like they used divide and conquer generalized to arbitrary thread count. gcc.gnu.org/onlinedocs/gcc-4.6.2/libstdc++/api/…Indult
It's extremely hard to translate an OpenMP program that uses plain parallel regions and barriers (thus relies on the number of threads) to TBB since TBBt is all about hiding threads and optional parallelism. TBB has no barriers, it's outside the law there. Though, the GCC's implementation looks like stretching of inherently recursive algorithm onto flat OMP region. So, if one can understand it and translate it natively to TBB, it looks promising approach.Dental
The interleaved arrangement is is likely to create a lot of false sharing conflicts (i.e. send cache-lines bouncing around) unless each item occupies a whole cache line.Peba
Here are some results from testing __gnu_parallel::partition on a std::array of 1,000,000,000 pointers: software.intel.com/sites/default/files/managed/2d/25/…Disown
P
3

I'd treat it as a degenerate case of parallel sample sort. (Parallel code for sample sort can be found here.) Let N be the number of items. The degenerate sample sort will require Θ(N) temporary space, has Θ(N) work, and Θ(P+ lg N) span (critical path). The last two values are important for analysis, since speedup is limited to work/span.

I'm assuming the input is a random-access sequence. The steps are:

  1. Allocate a temporary array big enough to hold a copy of the input sequence.
  2. Divide the input into K blocks. K is a tuning parameter. For a system with P hardware threads, K=max(4*P,L) might be good, where L is a constant for avoiding ridiculously small blocks. The "4*P" allows some load balancing.
  3. Move each block to its corresponding position in the temporary array and partition it using std::partition. Blocks can be processed in parallel. Remember the offset of the "middle" for each block. You might want to consider writing a custom routine that both moves (in the C++11 sense) and partitions a block.
  4. Compute the offset to where each part of a block should go in the final result. The offsets for the first part of each block can be done using an exclusive prefix sum over the offsets of the middles from step 3. The offsets for the second part of each block can be computed similarly by using the offset of each middle relative to the end of its block. The running sums in the latter case become offsets from the end of the final output sequence. Unless you're dealing with more than 100 hardware threads, I recommend using a serial exclusive scan.
  5. Move the two parts of each block from the temporary array back to the appropriate places in the original sequence. Copying each block can be done in parallel.

There is a way to embed the scan of step 4 into steps 3 and 5, so that the span can be reduced to Θ(lg N), but I doubt it's worth the additional complexity.

If using tbb::parallel_for loops to parallelize steps 3 and 5, consider using affinity_partitioner to help threads in step 5 pick up what they left in cache from step 3.

Note that partitioning requires only Θ(N) work for Θ(N) memory loads and stores. Memory bandwidth could easily become the limiting resource for speedup.

Peba answered 30/5, 2014 at 17:28 Comment(2)
Thank you, this is a very good suggestion. I was also worried about memory bandwidth which is why I was thinking about the interleaved partition idea. I was hoping it might save a lot of copying during the merge step. If I interleave in cache-sized blocks maybe I can avoid the false sharing problem too. I'm going to code it up on Monday, it should only require implementing special interleaved iterators and using them in parallel with std::partition...doesn't seem too hard.Disown
I hadn't seen the affinity_partitioner before, it looks useful, thanks.Disown
D
2

Why not to parallel something similar to std::partition_copy instead? The reasons are:

  • for std::partition, in-place swaps as in Adam's solution require logarithmic complexity due to recursive merge of the results.
  • you'll pay memory for parallelism anyway when using the threads and tasks.
  • if the objects are heavy, it is more reasonable to swap (shared) pointers anyway
  • if the results can be stored concurrently then threads can work independently.

It's pretty straight-forward to apply a parallel_for (for random-access iterators) or tbb::parallel_for_each (for non-random-access iterators) to start processing the input range. each task can store the 'true' and 'false' results independently. There are lots of ways to store the results, some from the top of my head:

  • using tbb::parallel_reduce (only for random-access iterators), store the results locally to the task body and move-append them in join() from another task
  • use tbb::concurrent_vector's method grow_by() to copy local results in a bunch or just push() each result separately on arrival.
  • cache thread-local results in tbb::combinable TLS container and combine them later

The exact semantics of std::partition_copy can be achieved by copy from the temporary storage from above or

  • (only for random-access output iterators) use atomic<size_t> cursors to synchronize where to store the results (assuming there is enough space)
Dental answered 29/5, 2014 at 14:30 Comment(2)
My "interleaved partition" idea was an attempt to get around the logarithmic complexity of the merge. Ignoring worst cases, two interleaved slices of an array should typically have similar value distributions. If so, the two resulting partition/middle pointers should end up near each other. The end sections of the original array (begin->middle_low, middle_high->end) will be correctly partitioned and won't need any merging. Only the elements in the middle range (middle_low->middle_high) would need swapping.Disown
@atb, if objects are aligned to the cache line size (for array/vector..) or allocated in separate cache lines (for list) it will not cause (much) false sharing. Otherwise, if the objects are tightly placed in an array, I'm afraid that interleaving access will cause extensive false sharing penalties, so the 'constant' complexity will not help because of too big constant. Interleaving may be useful for read-only structures.Dental
I
0

Your approach should be correct, but why not follow the regular divide-and-conquer (or parallel_for) method? For two threads:

  1. split the array in two. Turn your [start, end) into [start, middle), [middle, end).
  2. run std::partition on both ranges in parallel.
  3. merge the partitioned results. This can be done with a parallel_for.

This should make better use of the cache.

Indult answered 28/5, 2014 at 23:51 Comment(11)
i didn't like this approach because i thought it would require too much swapping during the merge. on average half the array may need to get swapped. merging interleaved arrays is much simpler because the order in the partitioned segments is arbitrary so most of the elements can be left alone, only the middle of the array needs to be cleaned up. but you're right, it might not be cache friendly...Disown
Yes, but that swap is very fast. The interleaving guarantees non-consecutive elements that reduce your cache hits. I don't know your plan to implement a striding iterator, but the only approaches I can think of will make the code uglier and make the compiler's job of optimizing your code harder. A work-efficient approach is not always the fastest. You asked for opinions, and that's mine.Indult
if the merge is needed, parallel_reduce fits better than parallel_forDental
@Dental huh? You have two ranges and you need to swap their contents. How is that a reduction?Indult
If you need two subranges how do you get them if not in parallel_reduce? Join operation of it guarantees that no other thread processes the same subrangeDental
@Dental a reduce takes a range and reduces it to a single value. How is that useful for the merge here? When we say merge, we mean we want to combine two partitioned lists into a single partitioned list.Indult
what you need is a parallel std::swap_ranges (i.e. a parallel_for).Indult
Ah, if you want to apply parallel_for only for swap_ranges, this is ok (for random-access iterators). I thought about more high-level approach. E.g. divide-and-conquer is done using parallel_reduce (note, it's not only about reducing of something, it's generally about binary splitting and joining of tasks in parallel), and swap_ranges is done using nested parallel_for. The join phase is exactly point (3) in your answer. point 2 is recursive, thus there will be a lot of overlapping subranges on the join phase (for consequent joins).Dental
@Dental "divide-and-conquer is done using parallel_reduce" I think you have that backwards. If not, can you provide a link?Indult
@Adam, sorry, link to what? I think documentation is clear enough, just pay attention to Body::Body( Body&, split ) and Body::join( Body& rhs ) and the following description. And if you have a specific question, please ask it separately (here or on the TBB forum)Dental
None of which applies to the merge step. But thanks for the condescending tone. Goodbye.Indult
S
0

It seems to me like this should parallelize nicely, any thoughts before I try it?

Well... maybe a few:

  • There's no real reason to create more tasks than you have cores. Since your algorithm is recursive, you also need to keep track not to create additional threads, after you reach your limit, cause it'll just be a needless effort.
  • Keep in mind that splitting and merging the arrays costs you processing power, so set the split size in a way, which won't actually slow your calculations down. Splitting a 10-element array can be tempting, but wont get you where you want to be. Since the complexity of std::partition is linear, it's fairly easy to overestimate the speed of the task.

Since you asked and gave an algorithm, I hope you actually need parallelization here. If so - there's nothing much to add, the algorithm itself looks really fine :)

Swacked answered 28/5, 2014 at 23:59 Comment(3)
i thought tbb liked having more tasks then threads so it can do better load balancing? but i agree about the lower limit, i tried to hint at that in step 1.Disown
@Disown the library can't do miracles. If your CPU can handle X tasks at once (so - simplifying - it has X cores), it won't do more, no matter what library you use. I'm not a specialist when it comes to Intels TBB, maybe it takes care of needless threads for you, that's something to be found in the manuals.Codd
@Disown is right, there is a reason for creating more tasks, it helps to load-balance the work. Even for what looks like a well-balanced work, it still makes sense since there are cache-misses, interrupts, and preemptions which create the imbalance. Of course, creating too many tasks can induce excessive overheads, so TBB recommends to keep tasks at minimum of 10K clocks. Also auto_partitioner does a good job of finding the golden mean for the number of tasks produced by parallel_for or parallel_reduceDental

© 2022 - 2024 — McMap. All rights reserved.