In what types of loops is it best to use the #pragma unroll directive in CUDA?

Asked 4/11, 2012 at 19:43 Answered 5/11, 2012 at 0:0

In CUDA it is possible to unroll loops using the #pragma unroll directive to improve performance by increasing instruction level parallelism. The #pragma can optionally be followed by a number that specifies how many times the loop must be unrolled.

Unfortunately the docs do not give specific directions on when this directive should be used. Since small loops with a known trip count are already unrolled by the compiler, should #pragma unroll be used on larger loops? On small loops with a variable counter? And what about the optional number of unrolls? Also is there recommended documentation about cuda specific loop unrolling?

Hinterland answered 4/11, 2012 at 19:43 Comment(0)

There aren't any fast and hard rules. The CUDA compiler has at least two unrollers, one each inside the NVVM or Open64 frontends, and one in the PTXAS backend. In general, they tend to unroll loops pretty aggressively, so I find myself using #pragma unroll 1 (to prevent unrolling) more often than any other unrolling attribute. The reasons for turning off loop unrolling are twofold:

(1) When a loop is unrolled completely, register pressure can increase. For example, indexes into small local memory arrays may become compile-time constants, allowing the compiler to place the local data into registers. Complete unrolling may also tends to lengthen basic blocks, allowing more aggressive scheduling of texture and global loads, which may require additional temporary variables and thus registers. Increased register pressure can lead to lower performance due to register spilling.

(2) Partially unrolled loops usually require a certain amount of pre-computation and clean-up code to handle loop counts that are not an exactly a multiple of the unrolling factor. For loops with short trip counts, this overhead can swamp any performance gains to be had from the unrolled loop, leading to lower performance after unrolling. While the compiler contains heuristics for finding suitable loops under these restrictions, the heuristics can't always provide the best decision.

In rare cases I have found that manually providing a higher unrolling factor than what the compiler used automatically has a small beneficial effect on performance (with typical gain in the single digit percent). These are typically cases of memory-intensive code where a larger unrolling factor allows more aggressive scheduling of global or texture loads, or very tight computationally bound loops that benefit from minimization of the loop overhead.

Playing with unrolling factors is something that should happen late in the optimization process, as the compiler defaults cover most cases one will encounter in practice.

Trespass answered 5/11, 2012 at 0:0 Comment(3)

So basically it is trial and error? Have a finished, optimized code and then try to unroll the various loops to see if it makes any difference in performance? Since the increased register usage from loop unrolling will be reported by --ptxas-options=-v isn't easy to track register spilling? – Hinterland 5/11, 2012 at 8:19

Loop unrolling may increase register pressure, it doesn't necessarily do so. You can get spill statistics from -Xptxas -v, correct. Loop unrolling is just one of many optimizations known to the compiler, so there are many complex interactions, driven mostly by heuristics. Heuristics are set up to "do the right thing" in most cases, thus my recommendation not to intervene manually until late in the optimization process. This is true of optimizing compilers in general, and not specific to the CUDA compiler. – Trespass 5/11, 2012 at 11:19

@njuffa: Sorry, my brain was in "LaTeX writing mode". – Hernando 8/4, 2017 at 15:14

-1

It's a tool that you can use to unroll loops. The specifics of when it should/shouldn't be used will vary a lot depending on your code (what's inside the loop for instance). There aren't really any good generic tips except think of what your code would be like unrolled vs rolled and think if it would be better unrolled.

Breechloader answered 4/11, 2012 at 19:53 Comment(0)

Recommended topics

Hot tags