Why does .NET Native compile loop in reverse order?

Asked 5/4, 2016 at 15:11 Answered 5/4, 2016 at 15:49

Solved c#assembly x86 micro-optimization .net-native

I'm working on optimization techniques performed by the .NET Native compiler. I've created a sample loop:

        for (int i = 0; i < 100; i++)
        {
            Function();
        }

And I've compiled it with Native. Then I disassembled the result .dll file with machine code inside in IDA. As the result, I have:

(I've removed a few unnecessary lines, so don't worry that address lines are inconsistent)

I understand that add esi, 0FFFFFFFFh means really subtract one from esi and alter Zero Flag if needed, so we can jump to the beginning if zero hasn't been reached yet.

What I don't understand is why did the compiler reverse the loop?

I came to the conclusion that

LOOP:
add esi, 0FFFFFFFFh
jnz LOOP

is just faster than for example

LOOP:
inc esi
cmp esi, 064h
jl LOOP

But is it really because of that and is the speed difference really significant?

Coth answered 5/4, 2016 at 15:11 Comment(7)

ADD with an immediate value is faster than INC and you also skip CMP...all of these in 3 lines of code. Then yes, difference is REALLY significant (both in size and in speed). Imagine to do this in ~30000 places in a real world program... – Consentient 5/4, 2016 at 15:14

Yes, it's faster, and in general optimizers will apply any optimization they can that makes your code faster without changing the semantics of your program. – Edson 5/4, 2016 at 15:16

Regarding inverted direction, perhaps comparison to zero is quicker than comparison to a specific value? – Intuitive 5/4, 2016 at 15:17

Yes, because you don't even need a comparison. As you can (not) see :) – Balneology 5/4, 2016 at 15:18

You've written the code both ways. If you want to know whether one way is faster than the other, then run them. – Inserted 5/4, 2016 at 15:25

@EricLippert I'm not lazy and I would love to, but I'm on my work PC right now and I don't have any tools to run or benchmark assembly code :( Also I don't have administrator rights to install anything. – Coth 5/4, 2016 at 15:33

A for(;;) loop normally requires also checking that the start value observes the end condition. Typical codegen is a branch forward to the condition code and then a branch back to the loop body code. But the optimizer can take a shortcut here, it knows that the start value is already good and that you don't use the value of i anywhere so it can generate less code. – Saccharide 5/4, 2016 at 17:1

inc might be slower than add because of the partial flag update. Moreover add affects the zero flag so you don't need to use another cmp instruction. Just jump directly.

This is one famous type of loop optimization

reversal: Loop reversal reverses the order in which values are assigned to the index variable. This is a subtle optimization which can help eliminate dependencies and thus enable other optimizations. Also, certain architectures utilize looping constructs at Assembly language level that count in a single direction only (e.g. decrement-jump-if-not-zero (DJNZ)).

You can see the result for other compilers here.

Midsummer answered 5/4, 2016 at 15:18 Comment(4)

inc is slower then add by one clock cycle. Compare them in the Intel® 64 and IA-32 Architectures Optimization Reference Manual. Scroll down to Appendix C and you can see the latency and throughput times of each x86/x64 instruction. 1 clock cycle may not seem significant, but if you got hundreds or thousands of loops, it will add up fast. – Conduction 5/4, 2016 at 15:29

@Conduction Those numbers don't reflect reality on the microarchitectures they describe (IvyBridge through Skylake; see the table earlier in that appendix). A dec/jnz loop can run at 1 iteration per cycle, and inc/dec have only 1 cycle latency for the integer register as part of other dep chains. Possibly Intel got the 2-cycle latency on IvyBridge through Broadwell (but not Skylake) from looking at latency to read EFLAGS, maybe including CF which would require a flag merge. But that's not a problem for dec / jnz even without fustion, or dec / setz. I only have a Skylake so I can't test :/ – Kristakristal 18/10, 2019 at 19:19

@Icemanind: Also note that those were latency numbers; the table you reference still listed inc/dec throughput at 0.25 cycles, i.e. 4-per-clock. Anyway, Agner Fog's instruction tables, based on experimental testing, lists inc/dec at 1c latency / 0.25c throughput. So does uops.info/table.html. uops.info even measured latency from input to integer output and to flag output, and found 1 cycle in both cases: uops.info/html-instr/INC_R32.html. (Not including the CF non-output) – Kristakristal 18/10, 2019 at 19:20

@Icemanind: avoiding dec here is only useful if the code might run on Silvermont or Pentium 4. Otherwise it's a waste of code-size for mainstream Intel and AMD. – Kristakristal 18/10, 2019 at 19:27

Your conclusion is correct: inverted cycle will target 0 (cycle will ends when register value reach 0), so that Add will set zero flag used in conditional branch.

This way you don't need dedicated Cmp which leads to: 1) size optimization 2) it's also faster (conclusion from compiler programmers decision and another answer).

That's pretty common assembler trick to write loop targeting 0. I am surprised you understand assembler, but don't know (asking) about it.

Behnken answered 5/4, 2016 at 15:49 Comment(0)

Recommended topics

Hot tags