Can OpenMP be used for GPUs?

Asked 10/3, 2015 at 11:33 Answered 29/7, 2017 at 18:44

multithreading fortran gpu openmp openacc

I've been searching the web but I'm still very confused about this topic. Can anyone explain this more clearly? I come from an Aerospace Engineering background (not from a Computer Science one), so when I read online about OpenMP/CUDA/etc. and multithreading I don't really understand a great deal of what is being said.

I'm currently trying to parallelize an in-house CFD software written in FORTRAN. These are my doubts:

OpenMP shares the workload using multiple threads from the CPU. Can it be used to allow the GPU to get some of the work too?
I've read about OpenACC. Is it similar to OpenMP (easy to use)?

I've also read about CUDA and kernels, but I don't have any much experience in parallel programming and I don't have the faintest idea of what a kernel is.

Is there an easy and portable way to share my workload with the GPU, for FORTRAN (if OpenMP doesn't do that and OpenACC is not portable)?

Can you give me a "for dummies" type of answer?

Yulandayule answered 10/3, 2015 at 11:33 Comment(3)

I'd suggest to have a look at OpenCL as it is an easy way to share the same code for execution on CPU and GPU. A kernel is the basic unit of executable code like a C-function which can be data-parallel or task-parallel. Bindings for Fortran to OpenCL also exist. Have look at the introduction series to OpenCL from AMD: youtube.com/watch?v=ecYIsu83c0I&list=PL3B46A983A7382FA6 – Ceasefire 10/3, 2015 at 11:41

Search on the term OpenMP accelerators. accelerators (of which GPUs are a type) were introduced with OpenMP 4.0. – Flong 10/3, 2015 at 16:18

With the upcoming GCC 5 compiler release there is the offloading infrastructure support in place as OpenMP 4.0 and OpenACC compute offloading to accelerators begin to mature in this open-source compiler. For those willing to toy with the latest experimental code, it's possible to get your feet wet if you have a NVIDIA GPU or supported Intel Xeon Phi MIC card. – Junction 13/3, 2015 at 8:3

Yes. The OpenMP 4 target constructs were designed to support a wide range of accelerators. Compiler support for NVIDIA GPUs is available from GCC 7+ (see 1 and 2, although the latter has not been updated to reflect OpenMP 4 GPU support), Clang (see 3,4,5), and Cray. Compiler support for Intel GPUs is available in the Intel C/C++ compiler (see e.g. 6).

The IBM-developed Clang/LLVM implementation of OpenMP 4+ for NVIDIA GPUs is available from https://github.com/clang-ykt. The build recipe is provided in "OpenMP compiler for CORAL/OpenPower Heterogeneous Systems".

The Cray compiler supports OpenMP target for NVIDIA GPUs. From Cray Fortran Reference Manual (8.5):

The OpenMP 4.5 target directives are supported for targeting NVIDIA GPUs or the current CPU target. An appropriate accelerator target module must be loaded to use target directives.

The Intel compiler supports OpenMP target for Intel Gen graphics for C/C++ but not Fortran. Furthermore, the teams and distribute clauses are not supported because they are not necessary/appropriate. Below is a simple example showing how the OpenMP target features work in different environments.

void vadd2(int n, float * a, float * b, float * c)
{
    #pragma omp target map(to:n,a[0:n],b[0:n]) map(from:c[0:n])
#if defined(__INTEL_COMPILER) && defined(__INTEL_OFFLOAD)
    #pragma omp parallel for simd
#else
    #pragma omp teams distribute parallel for simd
#endif
    for(int i = 0; i < n; i++)
        c[i] = a[i] + b[i];
}

The compiler options for Intel and GCC are as follows. I don't have GCC setup for NVIDIA GPUs but you can see the documentation for the appropriate -foffload options.

$ icc -std=c99 -qopenmp -qopenmp-offload=gfx -c vadd2.c && echo "SUCCESS" || echo "FAIL"
SUCCESS
$ gcc-7 -fopenmp -c vadd2.c && echo "SUCCESS" || echo "FAIL"
SUCCESS

Caddish answered 27/7, 2017 at 21:32 Comment(7)

The question asks specifically for Fortran. – Chirrupy 28/7, 2017 at 5:59

IBM is developing two OpenMP compilers. One is the Clang/LLVM one. The other is the XL compiler. For Fortran, the XL Fortran compiler supports a large subset of OpenMP 4.5 offloading to NVIDIA GPUs, starting in version 15.1.5. More features are being added this year and next year, with the aim of complete support in 2018. If you're on POWER, you can join the beta program to get access to the latest features. – Sawfly 28/7, 2017 at 16:54

@VladimirF The question says CUDA was considered. CUDA is a derivative of C/C++ so the OpenMP 4 support in Intel C/C++ is no less applicable. Furthermore, the C interoperability features in Fortran 2003 mean that Fortran application development is not mutually exclusive of accelerators models based on or limited to C/C++ (e.g. OpenCL). – Caddish 28/7, 2017 at 22:36

@RafikZurob You should post that as answer to the question, since it might be missed as a comment here. I work for Intel, so I am not in a position to evaluate or comment on the POWER ecosystem. I've used the Clang/LLVM implementation on x86, which is why I commented on it. – Caddish 28/7, 2017 at 22:48

@Jeff It might have been CUDA Fortran. Anyway, in my reading the question clearly asks about the aplicability of OpenMP for accelerators in Fortran. He speaks about a Fortran code (should he rewrite it completely in C++?) and the question has the fortran tag only. – Chirrupy 29/7, 2017 at 6:40

Mixed-language programming is increasingly common, particularly in the context of accelerators. I work on a code (NWChem) that is more than 4 million lines of Fortran, but someone wrapped CUDA C because that was the only way to use NVIDIA hardware in ~2010. I am genuinely confused by your hostility towards my answer because it references a C/C++ only feature of the Intel compiler as evidence that OpenMP 4.5 compilers can target GPU hardware besides NVIDIA's, not because I think that is what the OP wants to use. – Caddish 30/7, 2017 at 21:18

This is a very helpful answer. Thanks! On the CPU the simd clause is often not very helpful but on the GPU it appears to make a big difference (with GCC). See the end of this answer. – Junction 12/3, 2018 at 9:53

OpenMP 4.0 standard includes support of accelerators (GPU, DSP, Xeon Phi, and so on), but I don't know any existence implementation of OpenMP 4.0 standard for GPU, only early experience.
OpenACC is indeed similar to OpenMP and easy to use. Good OpenACC tutorial: part 1 and part 2.

Unfortunately, I think there is no portable solution for CPU and GPU, at least for now (except for OpenCL, but it is too low level compare to OpenMP and OpenACC).

If you need portable solution, you could consider using Intel Xeon Phi accelerator instead of GPU. Intel Fortran (and C/C++) compiler includes OpenMP support both for CPU and Xeon Phi.

In addition, to create a really portable solution, it is not enough to use suitable parallel technology. You have to modify your program in order to provide enough level of parallelism. See "Structured Parallel Programming" or similar books for examples of possible approaches.

Regrate answered 11/3, 2015 at 7:13 Comment(2)

What is it better? To run computational heavy parts of the program on a CPU or a GPU? Of course it depends on the specific hardware, but in general? – Shoreline 11/3, 2015 at 18:39

In general, it is better to run computational intensive parts on the GPU (or another accelerator such as Xeon Phi or FPGA). The performance of modern accelerators is at least 5 times higer then CPU performance. – Regrate 12/3, 2015 at 2:11

To add to what was said about support on other platforms above: IBM is contributing to two OpenMP 4.5 compilers: One is the open source Clang/LLVM one. The other is IBM's XL compiler. Both compilers share the same helper OpenMP offloading library, but differ in the compiler's code generation and optimization for the GPU. For Fortran, the XL Fortran compiler supports a large subset of OpenMP 4.5 offloading to NVIDIA GPUs, starting in version 15.1.5. (And version 13.1.5 for XL C/C++). More features are being added this year and next year, with the aim of complete support in 2018. If you're on POWER, you can join the XL compiler beta program to get access to our latest OpenMP offloading features in Fortran and C/C++.

Sawfly answered 29/7, 2017 at 18:44 Comment(0)

The previous answer covers most of it, but since you spoke about giving the GPU some work as well, you might want to take a look at frameworks for heterogeneous computing (CPU + GPU simultaneously), such as StarPU.

As StarPU is only for C/C++, you have ForOpenCL for Fortran.

You'll have to consider the trade-off performance-convenience in any case.

Izard answered 12/3, 2015 at 0:24 Comment(3)

Did you notice the Fortran tag? – Chirrupy 12/3, 2015 at 7:11

StarPU seems cool, but if I've seen correctly, is only for C. – Shoreline 12/3, 2015 at 8:7

This response does not even attempt answer the question. – Caddish 27/7, 2017 at 21:33

Recommended topics

Hot tags