How do optimizing compilers decide when and how much to unroll a loop?
Asked Answered
A

4

9

When a compiler performs a loop-unroll optimization, how does it determined by which factor to unroll the loop or whether to unroll the whole loop? Since this is a space-performance trade-off, on average how effictive is this optimization technique in making the program perform better? Also, under what conditions is it recommended to use this technique (i.e certain operations or calculations)?

This doesn't have to be specific to a certain compiler. It can be any explanation outlining the idea behind this technique and what has been observed in practice.

Anthracnose answered 7/10, 2011 at 18:12 Comment(9)
Are you looking for a paper on compiler optimization analysis? :)Calfee
I'd like to add: why gcc's help message says -funroll-all-loops actually makes the program run slower? Quoting: "Perform the optimization of loop unrolling. This is done for all loops and usually makes programs run more slowly."Elfie
@Jon, it doesnt matter, i just need a good answer.Anthracnose
I don't think this is a question that can be answered in this format. What you're looking for is in-depth analysis that would have to be some kind of academic paper with a lot of empirical data behind it.Bilingual
@Radu: the clauses beginning "how does it" and "under what conditions" are questions.Bk
A good heuristic is that loop unrolling is probably a bad idea if taken beyond the degree where instruction interleaving to avoid data dependency stalls and/or vectorization come into play.Wensleydale
@R.. D: you could make an answer with that, but try to make it more clear ;)Elfie
@BlackBear: The assumption behind this statement is that branch prediction is pretty good these days, so the branch is not all that expensive (especially true for loops). On the other hand, level-1 cache is still comparatively small and will remain so (it is hard to squeeze a lot more transistors that close to the CPU just because they won't fit). Cache misses are expensive. Which means larger code is quite possibly slower code due to cache trashing. Also, larger code increases the likelihood of other things, such as page faults.Involucre
@Steve: "how does it determined by which factor to unroll the loop or weather to unroll the whole loop or not" may be a question in a sense, but it's barely comprehensible.Ottilie
C
11

When a compiler performs a loop unroll optimization, how does it determined by which factor to unroll the loop or weather to unroll the whole loop or not.

stack consumption and locality. instruction counts. ability to make/propagate optimizations based on the unrolled and inlined program. whether the loop size is fixed, or expected to be in a certain range. profile inputs (if applicable). operations which may be removed from the loop body. etc.

Since this is a space-performance tradeoff on average how effictive is this optimization technique in making the program perform better?

it depends largely on the input (your program). it can be slower (not typical) or it can be several times faster. writing a program to run optimally and which also enables the optimizer to do its job is learned.

Also, under what conditions is it recommended to use this technique (i.e certain operations or calculations)

generally, a large number of iterations on very small bodies, particularly that which is branchless and has good data locality.

if you want to know if the option helps your app, profile.

if you need more than that, you should reserve some time to learn how to write optimal programs, since the subject is quite complex.

Comportment answered 7/10, 2011 at 18:47 Comment(4)
do you have any recommendations for resources about writing optimal programs ?Anthracnose
it really depends on your current knowledge level and the programs you write... perhaps you'll find this a good resource: agner.org/optimizeComportment
+1 For the link Justin. Found this bit on the MASM forums to be amusingly harsh: "Not for the faint of heart. If MASM is beyond you, take up server side scripting."Balm
@John yeah - Anger's manuals provide a lot of good information. I've written quite a bit of performance critical programs, and have rarely resorted to writing assembly. I'll typically resort to metaprogramming implementations instead (I like portability). Thanks for the quote :)Comportment
P
3

The simplistic analysis is to count instructions - a 2 instruction loop unrolled 10 times has 11 instructions instead of 20 yields a 11/20 speedup. But with modern processor architectures it's much more complex; depending on cache sizes and the characteristics of the processors instruction pipeline. It's possible that the above example would run 10x faster instead of 2x. It's also possible that unrolling 1000x instead of 10x would run slower. Without targeting a specific processor, compilers (or pragmas you write for them) are just guessing.

Plash answered 7/10, 2011 at 18:32 Comment(0)
M
2

Ok, first of all, i don't know how the compilers does this automatically. And i'm pretty sure there are at least 10s if not 100s of algorithms that compilers have to choose from.
And it's probably compiler-specific anyway.

But, I can help you with calculating its effectiveness.

Just note that this technique usually doesn't give you a great performance boost.
But in repeated looped calculations and can give high percentage performance.
This is because usually the function inside the loop takes much more computation time than the loop's condition checking.

So, lets say we have a simple loop with a constant, cause you were too lazy to do copy-paste or just thought it would look better:

for (int i = 0; i < 5; i++)
{
    DoSomething();
}

Here you have 5 int comparisons, 5 incrementations, and 5 DoSomethig() calls.
So if DoSomething() is relatively quick, then we got 15 operations.
Now if you'll unroll this, you'll reduce it to just 5 operations:

DoSomething();
DoSomething();
DoSomething();
DoSomething();
DoSomething();

Now with constants it's easier, so lets see how it would work with a variable:

for (int i = 0; i < n; i++)
{
    DoSomething();
}

Here you have n int comparisons, n incrementations, and n DoSomethig() calls = 3n . Now, we can't unroll it entirely, but we could unroll it by a constant factor (the higher n is expected to be, the more we should unroll it):

int i;
for (i = 0; i < n; i = i+3)
{
    DoSomething();
    DoSomething();
    DoSomething();
}
if (i - n == 2)
{
    DoSomething(); // We passed n by to, so there's one more left
}
else if (i - n == 1)
{
    DoSomething();  //We passed n by only 1, so there's two more left
    DoSomething();
}

Now here we have Here you have n/3+2 int comparisons, n/3 incrementations, and n DoSomethig() calls = (1 2/3)*n.
We saved ourselves (1 1/3)*n operations. Which cuts the computation time almost in half.

FYI, another neat unrolling technique is called Duff's device.
But it's very compiler and language-implementation specific. There are languages where this would actually be worse.

Mecham answered 7/10, 2011 at 18:36 Comment(0)
E
1

when it is (in my opinion) good to unroll a loop:

loop is short and possibly all variables used are in processor register. After unrolling variables are 'duplicated' but still are in registers so no memory(or cache) penalty.

loop (with unknown loop unrool number) will be executed at least few or dozen times, so there's justification for loading that whole loop unrolled to instruction cache.

if loop is short (one or just few intructions) it might be very beneficial for unrolling because code for determining if it should be executed again is executed less often.

Ejection answered 7/10, 2011 at 18:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.