OpenMP offloading to Nvidia wrong reduction
Asked Answered
R

1

8

I am interested in offloading work to the GPU with OpenMP.

The code below gives the correct value of sum on the CPU

//g++ -O3 -Wall foo.cpp -fopenmp
#pragma omp parallel for reduction(+:sum)                                                                                                                                    
for(int i = 0 ; i < 2000000000; i++) sum += i%11;

It also works on the GPU with OpenACC like this

//g++ -O3 -Wall foo.cpp -fopenacc   
#pragma acc parallel loop reduction(+:sum)                                                                                                                                    
for(int i = 0 ; i < 2000000000; i++) sum += i%11;

nvprof shows that it runs on the GPU and it's also faster than OpenMP on the CPU.

However when I try to offload to the GPU with OpenMP like this

//g++ -O3 -Wall foo.cpp -fopenmp -fno-stack-protector
#pragma omp target teams distribute parallel for reduction(+:sum)
for(int i = 0 ; i < 2000000000; i++) sum += i%11;

it gets the wrong result for sum (it just returns zero). nvprof seems to show that it runs on the GPU but it's much slower than OpenMP on the CPU.

Why is the reduction failing with OpenMP on the GPU?

Here is the full code I used to test this

#include <stdio.h>
//g++ -O3 -Wall acc2.cpp -fopenmp -fno-stack-protector                                                                                                                           
//sudo nvprof ./a.out                                                                                                                                                            
int main (void) {
  int sum = 0;
  //#pragma omp parallel for reduction(+:sum)                                                                                                                                    
  //#pragma acc parallel loop reduction(+:sum)                                                                                                                                   
  #pragma omp target teams distribute parallel for reduction(+:sum)
  for(int i = 0 ; i < 2000000000; i++) {
    sum += i%11;
  }
  printf("sum = %d\n",sum);
  return 0;
}

Using GCC 7.2.0, Ubuntu 17.10, along with gcc-offload-nvptx

Resurrect answered 5/3, 2018 at 13:22 Comment(7)
Probably sum is mapped to device as firstprivate (this is the default data-sharing attribute for scalar variables in target regions).Aret
@IlyaVerbin, thanks, I got it to work by adding the clause map(tofrom:sum). They don't to that in these slides. I learned about it here.Resurrect
yes, the default mapping changed between OpenMP 4.0 and OpenMP 4.5Aret
@IlyaVerbin, do you have any idea why OpenMP is much slower than OpenACC for this operations on the same hardware? OpenMP is about two times slower.Resurrect
No, I don't have any experience with GPU programming. I would advise to compare the PTX code, which is generated for both cases. "gcc -foffload=-save-temps ..." should dump it.Aret
@IlyaVerbin, the solution for the speed was to add the simd clause. Now OpenMP on the GPU is significantly faster than OpenACC or OpenMP on the CPU. See the end of my answer for more details.Resurrect
why do you use sudo for running nvprof ?Boxthorn
R
3

The solution was to add the clause map(tofrom:sum) like this:

//g++ -O3 -Wall foo.cpp -fopenmp -fno-stack-protector
#pragma omp target teams distribute parallel for reduction(+:sum) map(tofrom:sum)
for(int i = 0 ; i < 2000000000; i++) sum += i%11;

This gets the correct result for sum however the code is still much slower than with OpenACC or OpenMP without target.

Update: the solution to the speed was to add the simd clause. See the end of this answer for more information.


The solution above has a lot of clauses on one line. It can be broken up like this:

#pragma omp target data map(tofrom: sum)
#pragma omp target teams distribute parallel for reduction(+:sum)
for(int i = 0 ; i < 2000000000; i++) sum += i%11;

Another option is to use defaultmap(tofrom:scalar)

#pragma omp target teams distribute parallel for reduction(+:sum) defaultmap(tofrom:scalar)

Apparently, scalar variables in OpenMP 4.5 are firstprivate by default. https://developers.redhat.com/blog/2016/03/22/what-is-new-in-openmp-4-5-3/

defaultmap(tofrom:scalar) is convenient if you have multiple scalar values you want shared.


I also implemented the reduction manually to see if I could speed it up. I have not managed to speed it up but here is the code anyway (there are other optimizations I have tried but none of them have helped).

#include <omp.h>
#include <stdio.h>

//g++ -O3 -Wall acc2.cpp -fopenmp -fno-stack-protector
//sudo nvprof ./a.out

static inline int foo(int a, int b, int c) {
  return a > b ? (a/c)*b + (a%c)*b/c : (b/c)*a + (b%c)*a/c;
}

int main (void) {
  int nteams = 0, nthreads = 0;

  #pragma omp target teams map(tofrom: nteams) map(tofrom:nthreads)
  {
    nteams = omp_get_num_teams();
    #pragma omp parallel
    #pragma omp single
    nthreads = omp_get_num_threads();
  }
  int N = 2000000000;
  int sum = 0;

  #pragma omp declare target(foo)  

  #pragma omp target teams map(tofrom: sum)
  {
    int nteams = omp_get_num_teams();
    int iteam = omp_get_team_num();
    int start  = foo(iteam+0, N, nteams);
    int finish = foo(iteam+1, N, nteams);    
    int n2 = finish - start;
    #pragma omp parallel
    {
      int sum_team = 0;
      int ithread = omp_get_thread_num();
      int nthreads = omp_get_num_threads();
      int start2  = foo(ithread+0, n2, nthreads) + start;
      int finish2 = foo(ithread+1, n2, nthreads) + start;
      for(int i=start2; i<finish2; i++) sum_team += i%11;
      #pragma omp atomic
      sum += sum_team;
    }   
  }   

  printf("devices %d\n", omp_get_num_devices());
  printf("default device %d\n", omp_get_default_device());
  printf("device id %d\n", omp_get_initial_device());
  printf("nteams %d\n", nteams);
  printf("nthreads per team %d\n", nthreads);
  printf("total threads %d\n", nteams*nthreads);
  printf("sum %d\n", sum);
  return 0;
}

nvprof shows that most of the time is spend with cuCtxSynchronize. With OpenACC it's about half of that.


I finally managed to dramatically speed up the reduction. The solution was to add the simd clause

#pragma omp target teams distribute parallel for simd reduction(+:sum) map(tofrom:sum).

That's nine clauses on one line. A slightly shorter solution is

#pragma omp target map(tofrom:sum)
#pragma omp teams distribute parallel for simd reduction(+:sum)

The times are

OMP_GPU    0.25 s
ACC        0.47 s
OMP_CPU    0.64 s

OpenMP on the GPU now is much faster than OpenACC and OpenMP on the CPU . I don't know if OpenACC can be sped up with with some additional clauses.

Hopefully, Ubuntu 18.04 fixes gcc-offload-nvptx so that it does not need -fno-stack-protector.

Resurrect answered 5/3, 2018 at 13:22 Comment(1)
I got OpenACC to be about the same speed as OpenMP using #pragma acc parallel num_gangs(128) newline #pragma acc loop gang, vector reduction(+:sum)Resurrect

© 2022 - 2024 — McMap. All rights reserved.