When optimizing for Intel CPUs, always put the flag-setting instruction right before the conditional jump instruction (if it's one of the simple ones listed in the table below), so they can macro-fuse into one uop in the decoders.
Doing this is not significantly worse for older CPUs that don't do macro-fusion. Putting the flag-setting earlier might shorten the branch mispredict penalty by one for such CPUs, but out-of-order execution means that moving the dec
a couple instruction earlier won't make a real difference. See also Avoid stalling pipeline by calculating conditional early. To really make a difference, you do stuff like unroll the loop and/or branch on something that can be calculated more simply, ideally without a dependency on a slow input, so OoO exec can have the branch already resolved while working on older iterations of the loop body. i.e. the loop counter dep-chain can run ahead of the main work.
I don't have benchmarks, but I don't think the small downside on increasingly-rare CPUs justifies missing out on the front-end throughput benefit (decode and issue) for CPUs that do fusion. Total uop throughput can often be a bottleneck.
AMD Bulldozer/Piledriver/Steamroller can fuse test/cmp
with any jcc
, but only test/cmp
, not any other ALU instructions. So definitely put compares with branches. It's still valuable for Intel CPUs to put other things with branches if they can macro-fuse on sandybridge-family.
From Agner Fog's microarch guide, Table 9.2 (for Sandybridge / Ivybridge):
First | can pair with these | cannot pair with
instruction | (and the inverse) |
---------------------------------------------
cmp |jz, jc, jb, ja, jl, jg| js, jp, jo
add, sub |jz, jc, jb, ja, jl, jg| js, jp, jo
adc, sbb |none |
inc, dec |jz, jl, jg | jc, jb, ja, js, jp, jo
test | all |
and | all |
or, xor, not, neg | none |
shift, rotate | none |
Table 9.2. Instruction fusion
So basically, inc/dec
can macro-fuse with a jcc
as long as the condition only depends on bits that are modified by inc/dec
.
(Otherwise, they don't macro-fuse, and you get an extra uop inserted to merge the flags (like when you read eax
after writing al
). Or on earlier CPUs, a partial-flags stall.)
Core2 / Nehalem was more limited in macro-fusion capability (just for CMP/TEST with more limited JCC combinations), and Core2 couldn't macro-fuse in 64bit mode at all.
Read Agner Fog's optimizing asm and C guides, too, if you haven't already. They're full of essential knowledge.
dec
is not recommended because it causes partial flags update stall. – Forbdec
and theadd
? In that case can you not usejcxz
? (You can't make a conditional jump not-dependent on the instruction which produces the condition). – Rhineslea 4(%esi),%esi
for the addition and that doesn't affect flags, so you can insert asubl $1, %ecx
higher up. As @Rhines says, you can't get rid of the dependency unless you use theloop
instruction which is again not recommended. – Forbjcxz
, unless that lets you avoid atest
orcmp
instruction. On Intel CPUs, it's a 2-uop instruction. (Less of a big deal when code is in the uop cache, otherwise it can slow down decoding because it can only be handled by the complex decoder.) – Inadvertentdec
is fine when it macro-fuses with the following branch (on Intel CPUs.) AMD CPUs also avoid partial-flag stalls by treating separate bits of the flags as independent. (I haven't benchmarked AMD, or the non-macro-fused case on Intel, though.) – Inadvertent