I'd treat it as a degenerate case of parallel sample sort. (Parallel code for sample sort can be found here.) Let N be the number of items. The degenerate sample sort will require Θ(N) temporary space, has Θ(N) work, and Θ(P+ lg N) span (critical path). The last two values are important for analysis, since speedup is limited to work/span.
I'm assuming the input is a random-access sequence. The steps are:
- Allocate a temporary array big enough to hold a copy of the input sequence.
- Divide the input into K blocks. K is a tuning parameter. For a system with P hardware threads, K=max(4*P,L) might be good, where L is a constant for avoiding ridiculously small blocks. The "4*P" allows some load balancing.
- Move each block to its corresponding position in the temporary array and partition it using std::partition. Blocks can be processed in parallel. Remember the offset of the "middle" for each block. You might want to consider writing a custom routine that both moves (in the C++11 sense) and partitions a block.
- Compute the offset to where each part of a block should go in the final result. The offsets for the first part of each block can be done using an exclusive prefix sum over the offsets of the middles from step 3. The offsets for the second part of each block can be computed similarly by using the offset of each middle relative to the end of its block. The running sums in the latter case become offsets from the end of the final output sequence. Unless you're dealing with more than 100 hardware threads, I recommend using a serial exclusive scan.
- Move the two parts of each block from the temporary array back to the appropriate places in the original sequence. Copying each block can be done in parallel.
There is a way to embed the scan of step 4 into steps 3 and 5, so that the span can be reduced to Θ(lg N), but I doubt it's worth the additional complexity.
If using tbb::parallel_for loops to parallelize steps 3 and 5, consider using affinity_partitioner to help threads in step 5 pick up what they left in cache from step 3.
Note that partitioning requires only Θ(N) work for Θ(N) memory loads and stores. Memory bandwidth could easily become the limiting resource for speedup.
-fopenmp
and defining_GLIBCXX_PARALLEL
to use the parallel versions of the standard library that have already been written, tested, etc. If you really want to only runstd::partition
in parallel, you can includeparallel/algorithm
, and call__gnu_parallel::partition
. – Caston