There aren't any fast and hard rules. The CUDA compiler has at least two unrollers, one each inside the NVVM or Open64 frontends, and one in the PTXAS backend. In general, they tend to unroll loops pretty aggressively, so I find myself using #pragma unroll 1
(to prevent unrolling) more often than any other unrolling attribute. The reasons for turning off loop unrolling are twofold:
(1) When a loop is unrolled completely, register pressure can increase. For example, indexes into small local memory arrays may become compile-time constants, allowing the compiler to place the local data into registers. Complete unrolling may also tends to lengthen basic blocks, allowing more aggressive scheduling of texture and global loads, which may require additional temporary variables and thus registers. Increased register pressure can lead to lower performance due to register spilling.
(2) Partially unrolled loops usually require a certain amount of pre-computation and clean-up code to handle loop counts that are not an exactly a multiple of the unrolling factor. For loops with short trip counts, this overhead can swamp any performance gains to be had from the unrolled loop, leading to lower performance after unrolling. While the compiler contains heuristics for finding suitable loops under these restrictions, the heuristics can't always provide the best decision.
In rare cases I have found that manually providing a higher unrolling factor than what the compiler used automatically has a small beneficial effect on performance (with typical gain in the single digit percent). These are typically cases of memory-intensive code where a larger unrolling factor allows more aggressive scheduling of global or texture loads, or very tight computationally bound loops that benefit from minimization of the loop overhead.
Playing with unrolling factors is something that should happen late in the optimization process, as the compiler defaults cover most cases one will encounter in practice.