I am interested in offloading work to the GPU with OpenMP.
The code below gives the correct value of sum
on the CPU
//g++ -O3 -Wall foo.cpp -fopenmp
#pragma omp parallel for reduction(+:sum)
for(int i = 0 ; i < 2000000000; i++) sum += i%11;
It also works on the GPU with OpenACC like this
//g++ -O3 -Wall foo.cpp -fopenacc
#pragma acc parallel loop reduction(+:sum)
for(int i = 0 ; i < 2000000000; i++) sum += i%11;
nvprof
shows that it runs on the GPU and it's also faster than OpenMP on the CPU.
However when I try to offload to the GPU with OpenMP like this
//g++ -O3 -Wall foo.cpp -fopenmp -fno-stack-protector
#pragma omp target teams distribute parallel for reduction(+:sum)
for(int i = 0 ; i < 2000000000; i++) sum += i%11;
it gets the wrong result for sum
(it just returns zero). nvprof
seems to show that it runs on the GPU but it's much slower than OpenMP on the CPU.
Why is the reduction failing with OpenMP on the GPU?
Here is the full code I used to test this
#include <stdio.h>
//g++ -O3 -Wall acc2.cpp -fopenmp -fno-stack-protector
//sudo nvprof ./a.out
int main (void) {
int sum = 0;
//#pragma omp parallel for reduction(+:sum)
//#pragma acc parallel loop reduction(+:sum)
#pragma omp target teams distribute parallel for reduction(+:sum)
for(int i = 0 ; i < 2000000000; i++) {
sum += i%11;
}
printf("sum = %d\n",sum);
return 0;
}
Using GCC 7.2.0, Ubuntu 17.10, along with gcc-offload-nvptx
map(tofrom:sum)
. They don't to that in these slides. I learned about it here. – Resurrectsimd
clause. Now OpenMP on the GPU is significantly faster than OpenACC or OpenMP on the CPU. See the end of my answer for more details. – Resurrectsudo
for runningnvprof
? – Boxthorn