cuda inline and noinline device functions

Asked 13/6, 2014 at 14:2 Answered 28/7, 2017 at 12:23

Solved cuda inline device compiler-optimization

According to the documentation, in devices of compute capability 1.x the compiler will inline __device__ functions by default, but for devices of compute capability 2.x and higher it will only do so if deemed appropriate by the compiler. When is it appropriate not to? There are also qualifiers such as __noinline__ and __forceinline__. In which cases is it better not to inline a __device__ function?

Batrachian answered 13/6, 2014 at 14:2 Comment(3)

This post might be answering your question. – Bullen 13/6, 2014 at 14:22

Thanks for that link, but in which cases does it help to explicitly use noinline ? Does it help to, for example, reduce register pressure of a very large kernel? – Batrachian 13/6, 2014 at 16:37

The cases in which I have used __noinline__ it was used to limit code size and thus reduce excessive compile times. Use of __noinline__ has no predictable effect on register pressure that I am aware of. Inlining may allow more aggressive code movement such as load scheduling and this may increase register pressure, while not inlining may increase register pressure due to ABI restrictions. I have never found a case where use of __noinline__ improved performance, but of course such cases could exist. – Faretheewell 13/6, 2014 at 17:4

The compiler heuristic for inlining presumably evaluates the potential performance benefit from inlining due to the elimination of function call overhead against other characteristics including compile time. Aggressive inlining can lead to very large code that cause very long compile times. From observing the code generated for many different kernels, the CUDA compiler seems to inline in the vast majority of cases. Note that in some cases, inlining is currently not possible, for example when the called function is in a different, separately compiled, compilation unit.

In my experience, the instances in which it makes sense to override the compiler's inlining heuristic are rare. I have used __noinline__ to limit code size and thus reduce excessive compile times. Use of __noinline__ has no predictable effect on register pressure that I am aware of. Inlining may allow more aggressive code movement such as load scheduling and this may increase register pressure, while not inlining may increase register pressure due to ABI restrictions on the use of registers. I have never found a case where use of __noinline__ improved performance, but of course such cases could exist, possibly due to instruction cache effects.

Faretheewell answered 13/6, 2014 at 17:58 Comment(0)

I've experienced it that if you force __device__ function call to be compiled inline, it can decreases runtime to half. Just in a recent one, I made a function call (which passed just 5 variables to function) inline and kernel execution time decreased from 9.5ms to 4.5ms (almost half). And if you consider that you want to execute the same kernel hundred millions of times with total runtime of a week or more (like my case and many others that work on CFD or MD projects), increase in compile time is nothing important comparing to huge saving in runtime.

All in all, I think it worth to try inline function call impact on runtime especially for codes with very long runtimes.

Psychognosis answered 28/7, 2017 at 12:23 Comment(0)

Recommended topics

Hot tags