Are C++17 Parallel Algorithms implemented already?
Asked Answered
P

6

45

I was trying to play around with the new parallel library features proposed in the C++17 standard, but I couldn't get it to work. I tried compiling with the up-to-date versions of g++ 8.1.1 and clang++-6.0 and -std=c++17, but neither seemed to support #include <execution>, std::execution::par or anything similar.

When looking at the cppreference for parallel algorithms there is a long list of algorithms, claiming

Technical specification provides parallelized versions of the following 69 algorithms from algorithm, numeric and memory: ( ... long list ...)

which sounds like the algorithms are ready 'on paper', but not ready to use yet?

In this SO question from over a year ago the answers claim these features hadn't been implemented yet. But by now I would have expected to see some kind of implementation. Is there anything we can use already?

Pruett answered 25/6, 2018 at 20:3 Comment(2)
Seems to be that MSVC is the only major compiler that supports these features, see here.Formally
I'm looking for those features for g++ aswell but it doesn't seem to be planned yet...Pasargadae
L
20

You can refer here to check all C++17 feature implementation status. For your case, just search Standardization of Parallelism TS, and you will find only MSVC and Intel C++ compilers support this feature now.

Leila answered 16/11, 2018 at 5:48 Comment(1)
I put together an open source repo, which works on MSVC, Intel and g++ 9.1 compilers, demonstrating usage and performance of parallel sort, along with a parallel merge sort (github.com/DragonSpit/ParallelAlgorithms)Taps
D
50

GCC 9 has them but you have to install TBB separately

In Ubuntu 19.10, all components have finally aligned:

  • GCC 9 is the default one, and the minimum required version for TBB
  • TBB (Intel Thread Building Blocks) is at 2019~U8-1, so it meets the minimum 2018 requirement

so you can simply do:

sudo apt install gcc libtbb-dev
g++ -ggdb3 -O3 -std=c++17 -Wall -Wextra -pedantic -o main.out main.cpp -ltbb
./main.out

and use as:

#include <execution>
#include <algorithm>

std::sort(std::execution::par_unseq, input.begin(), input.end());

see also the full runnable benchmark below.

GCC 9 and TBB 2018 are the first ones to work as mentioned in the release notes: https://gcc.gnu.org/gcc-9/changes.html

Parallel algorithms and <execution> (requires Thread Building Blocks 2018 or newer).

Related threads:

Ubuntu 18.04 installation

Ubuntu 18.04 is a bit more involved:

Here are fully automated tested commands for Ubuntu 18.04:

# Install GCC 9
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install gcc-9 g++-9

# Compile libtbb from source.
sudo apt-get build-dep libtbb-dev
git clone https://github.com/intel/tbb
cd tbb
git checkout 2019_U9
make -j `nproc`
TBB="$(pwd)"
TBB_RELEASE="${TBB}/build/linux_intel64_gcc_cc7.4.0_libc2.27_kernel4.15.0_release"

# Use them to compile our test program.
g++-9 -ggdb3 -O3 -std=c++17 -Wall -Wextra -pedantic -I "${TBB}/include" -L 
"${TBB_RELEASE}" -Wl,-rpath,"${TBB_RELEASE}" -o main.out main.cpp -ltbb
./main.out

Test program analysis

I have tested with this program that compares the parallel and serial sorting speed.

main.cpp

#include <algorithm>
#include <cassert>
#include <chrono>
#include <execution>
#include <random>
#include <iostream>
#include <vector>

int main(int argc, char **argv) {
    using clk = std::chrono::high_resolution_clock;
    decltype(clk::now()) start, end;
    std::vector<unsigned long long> input_parallel, input_serial;
    unsigned int seed;
    unsigned long long n;

    // CLI arguments;
    std::uniform_int_distribution<uint64_t> zero_ull_max(0);
    if (argc > 1) {
        n = std::strtoll(argv[1], NULL, 0);
    } else {
        n = 10;
    }
    if (argc > 2) {
        seed = std::stoi(argv[2]);
    } else {
        seed = std::random_device()();
    }

    std::mt19937 prng(seed);
    for (unsigned long long i = 0; i < n; ++i) {
        input_parallel.push_back(zero_ull_max(prng));
    }
    input_serial = input_parallel;

    // Sort and time parallel.
    start = clk::now();
    std::sort(std::execution::par_unseq, input_parallel.begin(), input_parallel.end());
    end = clk::now();
    std::cout << "parallel " << std::chrono::duration<float>(end - start).count() << " s" << std::endl;

    // Sort and time serial.
    start = clk::now();
    std::sort(std::execution::seq, input_serial.begin(), input_serial.end());
    end = clk::now();
    std::cout << "serial " << std::chrono::duration<float>(end - start).count() << " s" << std::endl;

    assert(input_parallel == input_serial);
}

On Ubuntu 19.10, Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads, 2.90 GHz base, 8 MB cache), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB, 2400 Mbps) a typical output for an input with 100 million numbers to be sorted:

./main.out 100000000

was:

parallel 2.00886 s
serial 9.37583 s

so the parallel version was about 4.5 times faster! See also: What do the terms "CPU bound" and "I/O bound" mean?

We can confirm that the process is spawning threads with strace:

strace -f -s999 -v ./main.out 100000000 |& grep -E 'clone'

which shows several lines of type:

[pid 25774] clone(strace: Process 25788 attached
[pid 25774] <... clone resumed> child_stack=0x7fd8c57f4fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd8c57f59d0, tls=0x7fd8c57f5700, child_tidptr=0x7fd8c57f59d0) = 25788

Also, if I comment out the serial version and run with:

time ./main.out 100000000

I get:

real    0m5.135s
user    0m17.824s
sys     0m0.902s

which confirms again that the algorithm was parallelized since real < user, and gives an idea of how effectively it can be parallelized in my system (about 3.5x for 8 cores).

Error messages

Hey, Google, index this please.

If you don't have tbb installed, the error is:

In file included from /usr/include/c++/9/pstl/parallel_backend.h:14,
                 from /usr/include/c++/9/pstl/algorithm_impl.h:25,
                 from /usr/include/c++/9/pstl/glue_execution_defs.h:52,
                 from /usr/include/c++/9/execution:32,
                 from parallel_sort.cpp:4:
/usr/include/c++/9/pstl/parallel_backend_tbb.h:19:10: fatal error: tbb/blocked_range.h: No such file or directory
   19 | #include <tbb/blocked_range.h>
      |          ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.

so we see that <execution> depends on an uninstalled TBB component.

If TBB is too old, e.g. the default Ubuntu 18.04 one, it fails with:

#error Intel(R) Threading Building Blocks 2018 is required; older versions are not supported.
Dispirited answered 5/5, 2019 at 7:48 Comment(6)
It's specified here: gcc.gnu.org/onlinedocs/gcc-9.1.0/libstdc++/manual/manual/…: Note 3: The Parallel Algorithms have an external dependency on Intel TBB 2018 or later. If the <execution> header is included then -ltbb must be used to link to TBB. It seems to be that parallel algorithms are implemented by Intel's Parallel STL: github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/…, which itself requires TBB.Alrick
@DanielLangr thanks! The main question I had (lazy to research now ;-)) is: is TBB in-tree, if yes do I need to configure GCC build with any extra flags, otherwise how to install it on Ubuntu. I didn't know about -ltbb so that's a good start already :-) Then I'm going to benchmark something and give a graph here, it will be fun!Dispirited
Parallel STL headers are now a part of libstdc++. They define a parallel backend and the only one supported is TBB. The TBB backend header requires TBB headers. However, I cannot find TBB headers to be part of libstdc++. Therefore, it implies that not only linking the TBB library is required, but also providing TBB headers. Online demo supports it: wandbox.org/permlink/VSIcdvWCtTRko43Q.Alrick
BTW, I would prefer a standalone implementation of parallel STL algorithms than those from TBB. I extensively use parallel sorting and did a lot of experiments which generally revealed that the implementation of a parallel quicksort and mergesort from libstdc++ Parallel mode (PM) are superior to those from TBB. Unfortunately, PM is built upon OpenMP and its reimplementation by C++11 threads is anything but trivial.Alrick
@DanielLangr thanks for this info! You must be doing some fun stuff over there hehe ;-) I'll try to play around with this later on. If you have any open source benchmark code, do share a link.Dispirited
@DanielsaysreinstateMonica BTW, I have now found that in Ubuntu 19.10 everything just works and updated the answer :-)Dispirited
L
20

You can refer here to check all C++17 feature implementation status. For your case, just search Standardization of Parallelism TS, and you will find only MSVC and Intel C++ compilers support this feature now.

Leila answered 16/11, 2018 at 5:48 Comment(1)
I put together an open source repo, which works on MSVC, Intel and g++ 9.1 compilers, demonstrating usage and performance of parallel sort, along with a parallel merge sort (github.com/DragonSpit/ParallelAlgorithms)Taps
F
14

Intel has released a Parallel STL library which follows the C++17 standard:

It is being merged into GCC.

Faintheart answered 25/6, 2018 at 20:59 Comment(0)
L
5

Gcc does not yet implement the Parallelism TS (see https://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.2017)

However libstdc++ (with gcc) has an experimental mode for some equivalent parallel algorithms. See https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html

Getting it to work:

Any use of parallel functionality requires additional compiler and runtime support, in particular support for OpenMP. Adding this support is not difficult: just compile your application with the compiler flag -fopenmp. This will link in libgomp, the GNU Offloading and Multi Processing Runtime Library, whose presence is mandatory.

Code example

#include <vector>
#include <parallel/algorithm>

int main()
{
  std::vector<int> v(100);

  // ...

  // Explicitly force a call to parallel sort.
  __gnu_parallel::sort(v.begin(), v.end());
  return 0;
}
Leontineleontyne answered 25/6, 2018 at 20:38 Comment(0)
S
3

2023 UPDATE

Compiler and alternative library support for C++17 parallel algorithms:

Linux macOS Windows
GCC 8- No No No
GCC 9+ TBB Required TBB Required TBB Required
Clang (libstdc++) TBB Required TBB Required TBB Required
Clang (libc++) No No No
Apple Clang No
MSVC 15.7+ (2017) Yes
Parallel STL TBB Required TBB Required TBB Required
poolSTL Yes* Yes* Yes*

poolSTL does not implement all algorithms. However it is available as a single header file, so it's an easy backup to the other options.

MinGW is a strange one. Code using std::execution::par will compile and run, but performance is the same as sequential. I haven't found a reference to what the compiler actually supports (and why it's acting different from GCC), if anyone has insight please leave a comment.

Strobilaceous answered 28/11, 2023 at 11:0 Comment(0)
Z
-1

Gcc now support execution header, but not standard clang build from https://apt.llvm.org

Zelazny answered 11/7, 2021 at 19:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.