According to the documentation, in devices of compute capability 1.x the compiler will inline __device__
functions by default, but for devices of compute capability 2.x and higher it will only do so if deemed appropriate by the compiler. When is it appropriate not to? There are also qualifiers such as __noinline__
and __forceinline__
. In which cases is it better not to inline a __device__
function?
The compiler heuristic for inlining presumably evaluates the potential performance benefit from inlining due to the elimination of function call overhead against other characteristics including compile time. Aggressive inlining can lead to very large code that cause very long compile times. From observing the code generated for many different kernels, the CUDA compiler seems to inline in the vast majority of cases. Note that in some cases, inlining is currently not possible, for example when the called function is in a different, separately compiled, compilation unit.
In my experience, the instances in which it makes sense to override the compiler's inlining heuristic are rare. I have used __noinline__
to limit code size and thus reduce excessive compile times. Use of __noinline__
has no predictable effect on register pressure that I am aware of. Inlining may allow more aggressive code movement such as load scheduling and this may increase register pressure, while not inlining may increase register pressure due to ABI restrictions on the use of registers. I have never found a case where use of __noinline__
improved performance, but of course such cases could exist, possibly due to instruction cache effects.
I've experienced it that if you force __device__
function call to be compiled inline, it can decreases runtime to half. Just in a recent one, I made a function call (which passed just 5 variables to function) inline and kernel execution time decreased from 9.5ms to 4.5ms (almost half). And if you consider that you want to execute the same kernel hundred millions of times with total runtime of a week or more (like my case and many others that work on CFD or MD projects), increase in compile time is nothing important comparing to huge saving in runtime.
All in all, I think it worth to try inline function call impact on runtime especially for codes with very long runtimes.
© 2022 - 2024 — McMap. All rights reserved.
__noinline__
it was used to limit code size and thus reduce excessive compile times. Use of__noinline__
has no predictable effect on register pressure that I am aware of. Inlining may allow more aggressive code movement such as load scheduling and this may increase register pressure, while not inlining may increase register pressure due to ABI restrictions. I have never found a case where use of__noinline__
improved performance, but of course such cases could exist. – Faretheewell