Difference between kernels construct and parallel construct
Asked Answered
L

3

15

I study a lot of articles and the manual of OpenACC but still i don't understand the main difference of these two constructs.

Lashondalashonde answered 19/11, 2012 at 19:39 Comment(0)
P
23

kernels directive is the more general case and probably one that you might think of, if you've written GPU (e.g. CUDA) kernels before. kernels simply directs the compiler to work on a piece of code, and produce an arbitrary number of "kernels", of arbitrary "dimensions", to be executed in sequence, to parallelize/offload a particular section of code to the accelerator. The parallel construct allows finer-grained control of how the compiler will attempt to structure work on the accelerator, for example by specifying specific dimensions of parallelization. For example, the number of workers and gangs would normally be constant as part of the parallel directive (since only one underlying "kernel" is usually implied), but perhaps not on the kernels directive (since it may translate to multiple underlying "kernels").

A good treatment of this specific question is contained in this PGI article.

Quoting from the article summary: "The OpenACC kernels and parallel constructs each try to solve the same problem, identifying loop parallelism and mapping it to the machine parallelism. The kernels construct is more implicit, giving the compiler more freedom to find and map parallelism according to the requirements of the target accelerator. The parallel construct is more explicit, and requires more analysis by the programmer to determine when it is legal and appropriate. "

Pullet answered 20/11, 2012 at 2:13 Comment(1)
With GCC parallel is implemented much better. As far as I can tell reduction is not supported with kernel in GCC.Neoarsphenamine
M
1

OpenACC directives and GPU kernels are just two ways of representing the same thing -- a section of code that can run in parallel.

OpenACC may be best when retrofitting an existing app to take advantage of a GPU and/or when it is desirable to let the compiler handle more details related to issues such as memory management. This can make it faster to write an app, with a potential cost in performance.

Kernels may be best when writing a GPU app from scratch and/or when more fine grained control is desired. This can make the app take longer to write, but may increase performance.

I think that people new to GPUs may be tempted to go with OpenACC because it looks more familiar. But I think it's actually better to go the other way, and start with writing kernels, and then, potentially move to OpenACC to save time in some projects. The reason is that OpenACC is a leaky abstraction. So, while OpenACC may make it look as if the GPU details are abstracted out, they are still there. So, using OpenACC to write GPU code without understanding what is happening in the background is likely to be frustrating, with odd error messages when attempting to compile, and result in an app that has low performance.

Meggie answered 19/11, 2012 at 23:42 Comment(3)
This answer seems to be answering the question "What are the reasons to use or not use OpenACC" while ignoring the OP's question which has to do with differentiating between 2 slightly different ways of asking the OpenACC compiler to generate GPU code for a region. Also, quoting from the article linked "All non-trivial abstractions, to some degree, are leaky". So, a criticism with limited depth IMHO. I suggest it's better to assume this poster knows how to program GPUs and is, in fact, interested in the syntactical and functional differences between the 2 language constructs indicated.Pullet
I may indeed have answered the wrong question. I did not know that OpenACC also had a kernel concept. I thought it was all about directives, like OpenMP.Meggie
@RogerDahl - kernels is a directive defined by the OpenACC standard. It also includes the parallels directive.Reba
B
1

Parallel Construct

  1. Defines the region of the program that should be compiled for parallel execution on the accelerator device.

  2. The parallel loop directive is an assertion by the programmer that it is both safe and desirable to parallelize the affected loop. This relies on the programmer to have correctly identified parallelism in the code and remove anything in the code that may be unsafe to parallelize. If the programmer asserts incorrectly that the loop may be parallelized then the resulting application may produce incorrect results.

  3. The parallel construct allows finer-grained control of how the compiler will attempt to structure work on the accelerator. So it does not rely heavily on the compiler’s ability to automatically parallelize the code.

  4. When parallel loop is used on two subsequent loops that access the same data a compiler may or may not copy the data back and forth between the host and the device between the two loops.

  5. More experienced parallel programmers, who may have already identified parallel loops within their code, will likely find the parallel loop approach more desirable.

e.g refer

#pragma acc parallel
{
    #pragma acc loop
    for (i=0; i<n; i++) 
         a[i] = 3.0f*(float)(i+1);
    #pragma acc loop
    for (i=0; i<n; i++) 
         b[i] = 2.0f*a[i];
}

 Generate one kernel

 There is no barrier between the two loops: the second loop may start before the first loop ends. (This is different from OpenMP).

Kernels Construct

  1. Defines the region of the program that should be compiled into a sequence of kernels for execution on the accelerator device.

  2. An important thing to note about the kernels construct is that the compiler will analyze the code and only parallelize when it is certain that it is safe to do so. In some cases, the compiler may not have enough information at compile time to determine whether a loop is safe the parallelize, in which case it will not parallelize the loop, even if the programmer can clearly see that the loop is safely parallel.

  3. The kernels construct gives the compiler maximum leeway to parallelize and optimize the code how it sees fit for the target accelerator but also relies most heavily on the compiler’s ability to automatically parallelize the code.

  4. One more notable benefit that the kernels construct provides is that if multiple loops access the same data it will only be copied to the accelerator once which may result in less data motion.

  5. Programmers with less parallel programming experience or whose code contains a large number of loops that need to be analyzed may find the kernels approach much simpler, as it puts more of the burden on the compiler.

e.g refer

#pragma acc kernels
{
   for (i=0; i<n; i++)
       a[i] = 3.0f*(float)(i+1);
   for (i=0; i<n; i++)
        b[i] = 2.0f*a[i];
}

 Generate two kernels

 There is an implicit barrier between the two loops: the second loop will start after the first loop ends.

Boykins answered 29/12, 2019 at 8:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.