x86_64 - Assembly - loop conditions and out of order
Asked Answered
B

1

5

I am not asking for a benchmark.

(If that was the case, I would have done it myself.)


My question:

I tend to avoid the indirect/index addressing modes for convenience.

As a replacement, I often use immediate, absolute or register addressing.

The code:

; %esi has the array address. Say we iterate a doubleword (4bytes) array.
; %ecx is the array elements count
(0x98767) myloop:
    ... ;do whatever with %esi
    add $4, %esi
    dec %ecx
    jnz 0x98767;

Here, we have a serialized combo(dec and jnz) which prevent proper out of order execution (dependency).

Is there a way to avoid that / break the dep? (I am not an assembly expert).

Burson answered 2/8, 2015 at 11:38 Comment(12)
So let me get this straight: you want a conditional jump, which depends on the outcome of the previous instruction, to be executable out-of-order with that instruction? I think this is logically impossible.Rhines
Also note dec is not recommended because it causes partial flags update stall.Forb
@Jester: I should use a sub then ?Burson
@davmac: my goal is to not depend on the previous instructionBurson
@Burson do you mean you want to re-order the dec and the add? In that case can you not use jcxz? (You can't make a conditional jump not-dependent on the instruction which produces the condition).Rhines
You can use lea 4(%esi),%esi for the addition and that doesn't affect flags, so you can insert a subl $1, %ecx higher up. As @Rhines says, you can't get rid of the dependency unless you use the loop instruction which is again not recommended.Forb
@Jester: :) thanks for the lea tip.Burson
@davmac: yes I don't feel obliged to use the conditional jumps if there is a better solutionBurson
Also be sure to unroll the loop if possible, to amortize the cost of the loop overhead.Forb
@Jester: absolutely, but the length is variable. Nice tip though. Must take care of the cache line length.Burson
@davmac: I wouldn't recommend jcxz, unless that lets you avoid a test or cmp instruction. On Intel CPUs, it's a 2-uop instruction. (Less of a big deal when code is in the uop cache, otherwise it can slow down decoding because it can only be handled by the complex decoder.)Inadvertent
@jester: dec is fine when it macro-fuses with the following branch (on Intel CPUs.) AMD CPUs also avoid partial-flag stalls by treating separate bits of the flags as independent. (I haven't benchmarked AMD, or the non-macro-fused case on Intel, though.)Inadvertent
I
13

When optimizing for Intel CPUs, always put the flag-setting instruction right before the conditional jump instruction (if it's one of the simple ones listed in the table below), so they can macro-fuse into one uop in the decoders.

Doing this is not significantly worse for older CPUs that don't do macro-fusion. Putting the flag-setting earlier might shorten the branch mispredict penalty by one for such CPUs, but out-of-order execution means that moving the dec a couple instruction earlier won't make a real difference. See also Avoid stalling pipeline by calculating conditional early. To really make a difference, you do stuff like unroll the loop and/or branch on something that can be calculated more simply, ideally without a dependency on a slow input, so OoO exec can have the branch already resolved while working on older iterations of the loop body. i.e. the loop counter dep-chain can run ahead of the main work.

I don't have benchmarks, but I don't think the small downside on increasingly-rare CPUs justifies missing out on the front-end throughput benefit (decode and issue) for CPUs that do fusion. Total uop throughput can often be a bottleneck.

AMD Bulldozer/Piledriver/Steamroller can fuse test/cmp with any jcc, but only test/cmp, not any other ALU instructions. So definitely put compares with branches. It's still valuable for Intel CPUs to put other things with branches if they can macro-fuse on sandybridge-family.

From Agner Fog's microarch guide, Table 9.2 (for Sandybridge / Ivybridge):

First       | can pair with these  |  cannot pair with
instruction | (and the inverse)    |
---------------------------------------------
cmp         |jz, jc, jb, ja, jl, jg|   js, jp, jo
add, sub    |jz, jc, jb, ja, jl, jg|   js, jp, jo
adc, sbb    |none                  |
inc, dec    |jz, jl, jg            |   jc, jb, ja, js, jp, jo
test        | all                  |
and         | all                  |
or, xor, not, neg | none           |
shift, rotate     | none           |

Table 9.2. Instruction fusion

So basically, inc/dec can macro-fuse with a jcc as long as the condition only depends on bits that are modified by inc/dec.

(Otherwise, they don't macro-fuse, and you get an extra uop inserted to merge the flags (like when you read eax after writing al). Or on earlier CPUs, a partial-flags stall.)

Core2 / Nehalem was more limited in macro-fusion capability (just for CMP/TEST with more limited JCC combinations), and Core2 couldn't macro-fuse in 64bit mode at all.

Read Agner Fog's optimizing asm and C guides, too, if you haven't already. They're full of essential knowledge.

Inadvertent answered 3/8, 2015 at 1:53 Comment(4)
Thanks a lot Peter, I already had the "instruction tables" and "optimize assembly" from him. I didn't read the latter completely though (hence my ignorance), BUT I WILL do it now. Thanks Peter :)Burson
@Kroma: This table is from the microarchitecture.pdf. I forget if he mentions macro-fusion in the optimize asm guide, but probably at least mentions it.Inadvertent
Probably worth mention the first instruction and second instruction have to be in the same 16byte decoding segment for this to work (have the same address rounded down to 16 bytes). Think Agner Fog mentions that somewhere.Infusionism
@Noah: first, decode groups aren't always aligned. Second, Sandybridge-family hangs on to the last instruction in a group if it's a fusion candidate, in case the first instruction in the next group is a branch. (So it sacrifices some legacy-decode throughput to maybe build more compact uop-cache lines, and to maybe minimize ROB space and other back-end resource consumption this time through). I think this has been discussed on SO somewhere, but nothing specific comes to mind. Still, you might find something with google.Inadvertent

© 2022 - 2024 — McMap. All rights reserved.