Micro fusion and addressing modes

Asked 25/9, 2014 at 19:33 Answered 2/9, 2016 at 5:31

Solved assembly x86 intel cpu-architecture iaca

I have found something unexpected (to me) using the Intel® Architecture Code Analyzer (IACA).

The following instruction using [base+index] addressing

addps xmm1, xmmword ptr [rsi+rax*1]

does not micro-fuse according to IACA. However, if I use [base+offset] like this

addps xmm1, xmmword ptr [rsi]

IACA reports that it does fuse.

Section 2-11 of the Intel optimization reference manual gives the following as an example "of micro-fused micro-ops that can be handled by all decoders"

FADD DOUBLE PTR [RDI + RSI*8]

and Agner Fog's optimization assembly manual also gives examples of micro-op fusion using [base+index] addressing. See, for example, Section 12.2 "Same example on Core2". So what's the correct answer?

Moonlit answered 25/9, 2014 at 19:33 Comment(16)

Downvoter please explain yourself. Not all of us have time to test everything through experiment. – Moonlit 26/9, 2014 at 7:49

@IwillnotexistIdonotexist, I am trying to write tests to check this. Currently I have a case where IACA says the fused version has a block throughput of 2.0 and the non-fused version 6.0 but they both take the same time in practice. I am leaning towards the side that IACA has a bug. But if you find something please let me know. – Moonlit 26/9, 2014 at 20:4

@IwillnotexistIdonotexist, did you get a chance to look into this? I can't give out 500 point bountys every day :-/ – Moonlit 2/10, 2014 at 20:14

I genuinely don't know; I've been quite stumped on this problem the past few days although somebody dropped this useful Haswell diagram below your older question's answer. That fills my sails slightly - Micro/macrofusion happens at decode time and the ROB can't assist. – Crofoot 2/10, 2014 at 21:4

@IwillnotexistIdonotexist, that's a cool diagram! Thanks! Maybe I should just post a message on IACA forums about this. – Moonlit 2/10, 2014 at 21:27

I'm grasping at straws - the section you quote out of the Intel optimization manual is under "Sandy Bridge". Did you try running IACA with the flag -arch SNB for the example instructions, and addps xmm1, xmmword ptr [rsi+rax*1]? – Crofoot 3/10, 2014 at 4:39

For kicks I tried having IACA analyze the examples that Intel alleges will microfuse, but it turns out IACA claims fadd st0, qword ptr [rdi+rsi*8] does not microfuse, whether alone or unrolled 20 times. Don't know what to make of this. EDIT: That goes for all architectures: NHM, WSM, SNB, IVB and HSW. – Crofoot 4/10, 2014 at 1:34

For that matter, ret also is claimed to microfuse but doesn't according IACA, whereas jmp [rdi+200] does indeed microfuse. – Crofoot 4/10, 2014 at 2:13

And shockingly an instruction claimed not to microfuse (cmp dword ptr [rip-0x43], 0x1b) does microfuse according to IACA on both SNB and HSW! I think there's something seriously wrong in either the manual or IACA, and our next step is to experimentally determine who is right (IACA or the manual). – Crofoot 4/10, 2014 at 2:21

@IwillnotexistIdonotexist, yeah we need an experiment. My triad function is no good as it is now because on Core2-IB it needs 2 cycles with or without micro-op fusion anyway. On Haswell we already have an experiment to show that the fusion is simple on port 7 and if we fix the triad function to use port 7 it needs a compare which means port 6 takes two cycles. So some modification to the triad function is necessary or a new test altogether. – Moonlit 4/10, 2014 at 10:20

@IwillnotexistIdonotexist: the Intel manuals were probably written before SnB. . Sandybridge switched to a physical register file, made major under-the-hood changes to how uops are tracked. This came up in a discussion recently: stackoverflow.com/questions/31875464/…. Perf-counter experiments on SnB show that IACA is right. (except for rip-relative, glad you brought that up). I'm still waiting to hear if Skylake changed anything on this front. – Opt 9/8, 2015 at 10:31

@PeterCordes, I tested Nehalem as well. It does not fuse either using two registers. This problem goes back further than SNB. Though IwillnotexistIdonotexist already said about "That goes for all architectures: NHM, WSM, SNB, IVB and HSW". So I guess Intel's manual was written before Nehalem even. – Moonlit 28/10, 2015 at 12:48

@Zboson: Are you sure about Nehalem? In Agner Fog's answer on this question, he says that older Intel CPUs without a uop cache can do the fusion. Maybe Intel changed the internal uop format for Nehalem's 28uop loop buffer? IACA does show it not fusing on NHM. You tested with actual perf counters, though? – Opt 28/10, 2015 at 12:58

@PeterCordes, I only used IACA. I did not do any tests. Good point. I am assuming that IACA is right. Do you have proof otherwise (did I miss this in your answer)? My triad function on NHM - IVB needs at least two cycles due to the loads/stores on the same port so not-fusing is not an issue. It only matters since HSW (I resubmitted this comment due to some errors). – Moonlit 28/10, 2015 at 13:15

@Zboson: For Nehalem, no. I only personally tested uops with perf counters on SnB. IACA is known to be unreliable, so I wouldn't trust it in the face of other evidence: Agner Fog's statement, and the fact that Sandybridge was when Intel made major changes to the internals (including the uop format IIRC what I read). SnB is generally considered the point at which P6 evolved into a new species of microarchitecture. – Opt 28/10, 2015 at 14:16

Regarding the initial downvote, there appears to be a crop of militants on SO who summarily downvote any/everything that could be perceived as being related to micro-optimization. What they perhaps neglect to understand is that, despite the inherent value and importance of such study, it can also be fun. – Tranquillize 16/6, 2019 at 20:41

In the decoders and uop-cache, addressing mode doesn't affect micro-fusion (except that an instruction with an immediate operand can't micro-fuse a RIP-relative addressing mode).

But some combinations of uop and addressing mode can't stay micro-fused in the ROB (in the out-of-order core), so Intel SnB-family CPUs "un-laminate" when necessary, at some point before the issue/rename stage. For issue-throughput, and out-of-order window size (ROB-size), fused-domain uop count after un-lamination is what matters.

Intel's optimization manual describes un-lamination for Sandybridge in Section E.2.2.4: Micro-op Queue and the Loop Stream Detector (LSD), but doesn't describe the changes for any later microarchitectures.

UPDATE: Now Intel manual has a detailed section to describe un-lamination for Haswell. See section E.1.5 Unlamination. And a brief description for SandyBridge is in section E.2.2.4.

The rules, as best I can tell from experiments on SnB, HSW, and SKL:

SnB (and I assume also IvB): indexed addressing modes are always un-laminated, others stay micro-fused. IACA is (mostly?) correct.
HSW, SKL: These only keep an indexed ALU instruction micro-fused if it has 2 operands and treats the dst register as read-modify-write. Here "operands" includes flags, meaning that adc and cmov don't micro-fuse. Most VEX-encoded instructions also don't fuse since they generally have three operands (so paddb xmm0, [rdi+rbx] fuses but vpaddb xmm0, xmm0, [rdi+rbx] doesn't). Finally, the occasional 2-operand instruction where the first operand is write only, such as pabsb xmm0, [rax + rbx] also do not fuse. IACA is wrong, applying the SnB rules.

Related: simple (non-indexed) addressing modes are the only ones that the dedicated store-address unit on port7 (Haswell and later) can handle, so it's still potentially useful to avoid indexed addressing modes for stores. (A good trick for this is to address your dst with a single register, but src with dst+(initial_src-initial_dst). Then you only have to increment the dst register inside a loop.)

Note that some instructions never micro-fuse at all (even in the decoders/uop-cache). e.g. shufps xmm, [mem], imm8, or vinsertf128 ymm, ymm, [mem], imm8, are always 2 uops on SnB through Skylake, even though their register-source versions are only 1 uop. This is typical for instructions with an imm8 control operand plus the usual dest/src1, src2 register/memory operands, but there are a few other cases. e.g. PSRLW/D/Q xmm,[mem] (vector shift count from a memory operand) doesn't micro-fuse, and neither does PMULLD.

See also this post on Agner Fog's blog for discussion about issue throughput limits on HSW/SKL when you read lots of registers: Lots of micro-fusion with indexed addressing modes can lead to slowdowns vs. the same instructions with fewer register operands: one-register addressing modes and immediates. We don't know the cause yet, but I suspect some kind of register-read limit, maybe related to reading lots of cold registers from the PRF.

Test cases, numbers from real measurements: These all micro-fuse in the decoders, AFAIK, even if they're later un-laminated.

# store
mov        [rax], edi  SnB/HSW/SKL: 1 fused-domain, 2 unfused.  The store-address uop can run on port7.
mov    [rax+rsi], edi  SnB: unlaminated.  HSW/SKL: stays micro-fused.  (The store-address can't use port7, though).
mov [buf +rax*4], edi  SnB: unlaminated.  HSW/SKL: stays micro-fused.

# normal ALU stuff
add    edx, [rsp+rsi]  SnB: unlaminated.  HSW/SKL: stays micro-fused.  
# I assume the majority of traditional/normal ALU insns are like add

Three-input instructions that HSW/SKL may have to un-laminate

vfmadd213ps xmm0,xmm0,[rel buf] HSW/SKL: stays micro-fused: 1 fused, 2 unfused.
vfmadd213ps xmm0,xmm0,[rdi]     HSW/SKL: stays micro-fused
vfmadd213ps xmm0,xmm0,[0+rdi*4] HSW/SKL: un-laminated: 2 uops in fused & unfused-domains.
     (So indexed addressing mode is still the condition for HSW/SKL, same as documented by Intel for SnB)

# no idea why this one-source BMI2 instruction is unlaminated
# It's different from ADD in that its destination is write-only (and it uses a VEX encoding)
blsi   edi, [rdi]       HSW/SKL: 1 fused-domain, 2 unfused.
blsi   edi, [rdi+rsi]   HSW/SKL: 2 fused & unfused-domain.


adc         eax, [rdi] same as cmov r, [rdi]
cmove       ebx, [rdi]   Stays micro-fused.  (SnB?)/HSW: 2 fused-domain, 3 unfused domain.  
                         SKL: 1 fused-domain, 2 unfused.

# I haven't confirmed that this micro-fuses in the decoders, but I'm assuming it does since a one-register addressing mode does.

adc   eax, [rdi+rsi] same as cmov r, [rdi+rsi]
cmove ebx, [rdi+rax]  SnB: untested, probably 3 fused&unfused-domain.
                      HSW: un-laminated to 3 fused&unfused-domain.  
                      SKL: un-laminated to 2 fused&unfused-domain.

I assume that Broadwell behaves like Skylake for adc/cmov.

It's strange that HSW un-laminates memory-source ADC and CMOV. Maybe Intel didn't get around to changing that from SnB before they hit the deadline for shipping Haswell.

Agner's insn table says cmovcc r,m and adc r,m don't micro-fuse at all on HSW/SKL, but that doesn't match my experiments. The cycle counts I'm measuring match up with the the fused-domain uop issue count, for a 4 uops / clock issue bottleneck. Hopefully he'll double-check that and correct the tables.

Memory-dest integer ALU:

add        [rdi], eax  SnB: untested (Agner says 2 fused-domain, 4 unfused-domain (load + ALU  + store-address + store-data)
                       HSW/SKL: 2 fused-domain, 4 unfused.
add    [rdi+rsi], eax  SnB: untested, probably 4 fused & unfused-domain
                       HSW/SKL: 3 fused-domain, 4 unfused.  (I don't know which uop stays fused).
                  HSW: About 0.95 cycles extra store-forwarding latency vs. [rdi] for the same address used repeatedly.  (6.98c per iter, up from 6.04c for [rdi])
                  SKL: 0.02c extra latency (5.45c per iter, up from 5.43c for [rdi]), again in a tiny loop with dec ecx/jnz


adc     [rdi], eax      SnB: untested
                        HSW: 4 fused-domain, 6 unfused-domain.  (same-address throughput 7.23c with dec, 7.19c with sub ecx,1)
                        SKL: 4 fused-domain, 6 unfused-domain.  (same-address throughput ~5.25c with dec, 5.28c with sub)
adc     [rdi+rsi], eax  SnB: untested
                        HSW: 5 fused-domain, 6 unfused-domain.  (same-address throughput = 7.03c)
                        SKL: 5 fused-domain, 6 unfused-domain.  (same-address throughput = ~5.4c with sub ecx,1 for the loop branch, or 5.23c with dec ecx for the loop branch.)

Yes, that's right, adc [rdi],eax / dec ecx / jnz runs faster than the same loop with add instead of adc on SKL. I didn't try using different addresses, since clearly SKL doesn't like repeated rewrites of the same address (store-forwarding latency higher than expected. See also this post about repeated store/reload to the same address being slower than expected on SKL.

Memory-destination adc is so many uops because Intel P6-family (and apparently SnB-family) can't keep the same TLB entries for all the uops of a multi-uop instruction, so it needs an extra uop to work around the problem-case where the load and add complete, and then the store faults, but the insn can't just be restarted because CF has already been updated. Interesting series of comments from Andy Glew (@krazyglew).

Presumably fusion in the decoders and un-lamination later saves us from needing microcode ROM to produce more than 4 fused-domain uops from a single instruction for adc [base+idx], reg.

Why SnB-family un-laminates:

Sandybridge simplified the internal uop format to save power and transistors (along with making the major change to using a physical register file, instead of keeping input / output data in the ROB). SnB-family CPUs only allow a limited number of input registers for a fused-domain uop in the out-of-order core. For SnB/IvB, that limit is 2 inputs (including flags). For HSW and later, the limit is 3 inputs for a uop. I'm not sure if memory-destination add and adc are taking full advantage of that, or if Intel had to get Haswell out the door with some instructions

Nehalem and earlier have a limit of 2 inputs for an unfused-domain uop, but the ROB can apparently track micro-fused uops with 3 input registers (the non-memory register operand, base, and index).

So indexed stores and ALU+load instructions can still decode efficiently (not having to be the first uop in a group), and don't take extra space in the uop cache, but otherwise the advantages of micro-fusion are essentially gone for tuning tight loops. "un-lamination" happens before the 4-fused-domain-uops-per-cycle issue/retire width out-of-order core. The fused-domain performance counters (uops_issued / uops_retired.retire_slots) count fused-domain uops after un-lamination.

Intel's description of the renamer (Section 2.3.3.1: Renamer) implies that it's the issue/rename stage which actually does the un-lamination, so uops destined for un-lamination may still be micro-fused in the 28/56/64 fused-domain uop issue queue / loop-buffer (aka the IDQ).

TODO: test this. Make a loop that should just barely fit in the loop buffer. Change something so one of the uops will be un-laminated before issuing, and see if it still runs from the loop buffer (LSD), or if all the uops are now re-fetched from the uop cache (DSB). There are perf counters to track where uops come from, so this should be easy.

Harder TODO: if un-lamination happens between reading from the uop cache and adding to the IDQ, test whether it can ever reduce uop-cache bandwidth. Or if un-lamination happens right at the issue stage, can it hurt issue throughput? (i.e. how does it handle the leftover uops after issuing the first 4.)

(See the a previous version of this answer for some guesses based on tuning some LUT code, with some notes on vpgatherdd being about 1.7x more cycles than a pinsrw loop.)

Experimental testing on SnB

The HSW/SKL numbers were measured on an i5-4210U and an i7-6700k. Both had HT enabled (but the system idle so the thread had the whole core to itself). I ran the same static binaries on both systems, Linux 4.10 on SKL and Linux 4.8 on HSW, using ocperf.py. (The HSW laptop NFS-mounted my SKL desktop's /home.)

The SnB numbers were measured as described below, on an i5-2500k which is no longer working.

Confirmed by testing with performance counters for uops and cycles.

I found a table of PMU events for Intel Sandybridge, for use with Linux's perf command. (Standard perf unfortunately doesn't have symbolic names for most hardware-specific PMU events, like uops.) I made use of it for a recent answer.

ocperf.py provides symbolic names for these uarch-specific PMU events, so you don't have to look up tables. Also, the same symbolic name works across multiple uarches. I wasn't aware of it when I first wrote this answer.

To test for uop micro-fusion, I constructed a test program that is bottlenecked on the 4-uops-per-cycle fused-domain limit of Intel CPUs. To avoid any execution-port contention, many of these uops are nops, which still sit in the uop cache and go through the pipeline the same as any other uop, except they don't get dispatched to an execution port. (An xor x, same, or an eliminated move, would be the same.)

Test program: yasm -f elf64 uop-test.s && ld uop-test.o -o uop-test

GLOBAL _start
_start:
    xor eax, eax
    xor ebx, ebx
    xor edx, edx
    xor edi, edi
    lea rsi, [rel mydata]   ; load pointer
    mov ecx, 10000000
    cmp dword [rsp], 2      ; argc >= 2
    jge .loop_2reg

ALIGN 32
.loop_1reg:
    or eax, [rsi + 0]
    or ebx, [rsi + 4]
    dec ecx
    nop
    nop
    nop
    nop
    jg .loop_1reg
;   xchg r8, r9     ; no effect on flags; decided to use NOPs instead

    jmp .out

ALIGN 32
.loop_2reg:
    or eax, [rsi + 0 + rdi]
    or ebx, [rsi + 4 + rdi]
    dec ecx
    nop
    nop
    nop
    nop
    jg .loop_2reg

.out:
    xor edi, edi
    mov eax, 231    ;  exit(0)
    syscall

SECTION .rodata
mydata:
db 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff

I also found that the uop bandwidth out of the loop buffer isn't a constant 4 per cycle, if the loop isn't a multiple of 4 uops. (i.e. it's abc, abc, ...; not abca, bcab, ...). Agner Fog's microarch doc unfortunately wasn't clear on this limitation of the loop buffer. See Is performance reduced when executing loops whose uop count is not a multiple of processor width? for more investigation on HSW/SKL. SnB may be worse than HSW in this case, but I'm not sure and don't still have working SnB hardware.

I wanted to keep macro-fusion (compare-and-branch) out of the picture, so I used nops between the dec and the branch. I used 4 nops, so with micro-fusion, the loop would be 8 uops, and fill the pipeline with at 2 cycles per 1 iteration.

In the other version of the loop, using 2-operand addressing modes that don't micro-fuse, the loop will be 10 fused-domain uops, and run in 3 cycles.

Results from my 3.3GHz Intel Sandybridge (i5 2500k). I didn't do anything to get the cpufreq governor to ramp up clock speed before testing, because cycles are cycles when you aren't interacting with memory. I've added annotations for the performance counter events that I had to enter in hex.

testing the 1-reg addressing mode: no cmdline arg

$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,r2c2,r1c2,stalled-cycles-frontend,stalled-cycles-backend ./uop-test

Performance counter stats for './uop-test':

     11.489620      task-clock (msec)         #    0.961 CPUs utilized
    20,288,530      cycles                    #    1.766 GHz
    80,082,993      instructions              #    3.95  insns per cycle
                                              #    0.00  stalled cycles per insn
    60,190,182      r1b1  ; UOPS_DISPATCHED: (unfused-domain.  1->umask 02 -> uops sent to execution ports from this thread)
    80,203,853      r10e  ; UOPS_ISSUED: fused-domain
    80,118,315      r2c2  ; UOPS_RETIRED: retirement slots used (fused-domain)
   100,136,097      r1c2  ; UOPS_RETIRED: ALL (unfused-domain)
       220,440      stalled-cycles-frontend   #    1.09% frontend cycles idle
       193,887      stalled-cycles-backend    #    0.96% backend  cycles idle

   0.011949917 seconds time elapsed

testing the 2-reg addressing mode: with a cmdline arg

$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,r2c2,r1c2,stalled-cycles-frontend,stalled-cycles-backend ./uop-test x

 Performance counter stats for './uop-test x':

         18.756134      task-clock (msec)         #    0.981 CPUs utilized
        30,377,306      cycles                    #    1.620 GHz
        80,105,553      instructions              #    2.64  insns per cycle
                                                  #    0.01  stalled cycles per insn
        60,218,693      r1b1  ; UOPS_DISPATCHED: (unfused-domain.  1->umask 02 -> uops sent to execution ports from this thread)
       100,224,654      r10e  ; UOPS_ISSUED: fused-domain
       100,148,591      r2c2  ; UOPS_RETIRED: retirement slots used (fused-domain)
       100,172,151      r1c2  ; UOPS_RETIRED: ALL (unfused-domain)
           307,712      stalled-cycles-frontend   #    1.01% frontend cycles idle
         1,100,168      stalled-cycles-backend    #    3.62% backend  cycles idle

       0.019114911 seconds time elapsed

So, both versions ran 80M instructions, and dispatched 60M uops to execution ports. (or with a memory source dispatches to an ALU for the or, and a load port for the load, regardless of whether it was micro-fused or not in the rest of the pipeline. nop doesn't dispatch to an execution port at all.) Similarly, both versions retire 100M unfused-domain uops, because the 40M nops count here.

The difference is in the counters for the fused-domain.

The 1-register address version only issues and retires 80M fused-domain uops. This is the same as the number of instructions. Each insn turns into one fused-domain uop.
The 2-register address version issues 100M fused-domain uops. This is the same as the number of unfused-domain uops, indicating that no micro-fusion happened.

I suspect that you'd only see a difference between UOPS_ISSUED and UOPS_RETIRED(retirement slots used) if branch mispredicts led to uops being cancelled after issue, but before retirement.

And finally, the performance impact is real. The non-fused version took 1.5x as many clock cycles. This exaggerates the performance difference compared to most real cases. The loop has to run in a whole number of cycles (on Sandybridge where the LSD is less sophisticated), and the 2 extra uops push it from 2 to 3. Often, an extra 2 fused-domain uops will make less difference. And potentially no difference, if the code is bottlecked by something other than 4-fused-domain-uops-per-cycle.

Still, code that makes a lot of memory references in a loop might be faster if implemented with a moderate amount of unrolling and incrementing multiple pointers which are used with simple [base + immediate offset] addressing, instead of the using [base + index] addressing modes.

Further stuff

Bottleneck when using indexed addressing modes - un-lamination may slow down the front-end more than an extra 1 uop normally would.

RIP-relative with an immediate can't micro-fuse. Agner Fog's testing shows that this is the case even in the decoders / uop-cache, so they never fuse in the first place (rather than being un-laminated).

IACA gets this wrong, and claims that both of these micro-fuse:

cmp dword  [abs mydata], 0x1b   ; fused counters != unfused counters (micro-fusion happened, and wasn't un-laminated).  Uses 2 entries in the uop-cache, according to Agner Fog's testing
cmp dword  [rel mydata], 0x1b   ; fused counters ~= unfused counters (micro-fusion didn't happen)

(There are some more limits for micro+macro fusion to both happen for a cmp/jcc. TODO: write that up for testing a memory location.)

RIP-rel does micro-fuse (and stay fused) when there's no immediate, e.g.:

or  eax, dword  [rel mydata]    ; fused counters != unfused counters, i.e. micro-fusion happens

Micro-fusion doesn't increase the latency of an instruction. The load can issue before the other input is ready.

ALIGN 32
.dep_fuse:
    or eax, [rsi + 0]
    or eax, [rsi + 0]
    or eax, [rsi + 0]
    or eax, [rsi + 0]
    or eax, [rsi + 0]
    dec ecx
    jg .dep_fuse

This loop runs at 5 cycles per iteration, because of the eax dep chain. No faster than a sequence of or eax, [rsi + 0 + rdi], or mov ebx, [rsi + 0 + rdi] / or eax, ebx. (The unfused and the mov versions both run the same number of uops.) Scheduling / dep checking happens in the unfused-domain. Newly issued uops go into the scheduler (aka Reservation Station (RS)) as well as the ROB. They leave the scheduler after dispatching (aka being sent to an execution unit), but stay in the ROB until retirement. So the out-of-order window for hiding load latency is at least the scheduler size (54 unfused-domain uops in Sandybridge, 60 in Haswell, 97 in Skylake).

Micro-fusion doesn't have a shortcut for the base and offset being the same register. A loop with or eax, [mydata + rdi+4*rdi] (where rdi is zeroed) runs as many uops and cycles as the loop with or eax, [rsi+rdi]. This addressing mode could be used for iterating over an array of odd-sized structs starting at a fixed address. This is probably never used in most programs, so it's no surprise that Intel didn't spend transistors on allowing this special-case of 2-register modes to micro-fuse. (And Intel documents it as "indexed addressing modes" anyway, where a register and scale factor are needed.)

Macro-fusion of a cmp/jcc or dec/jcc creates a uop that stays as a single uop even in the unfused-domain. dec / nop / jge can still run in a single cycle but is three uops instead of one.

Opt answered 24/6, 2015 at 13:17 Comment(39)

Too bad consumer Skylake processors won't have AVX512. AVX-512 is a lot less interesting now. – Moonlit 24/6, 2015 at 13:20

yeah, my sentiments exactly. I'm hoping Skylake Xeons will come out around the same time as desktop. A Haswell "workstation" with a xeon CPU doesn't cost much more than quality desktop, and you can use ECC RAM without limiting yourself to an i3. – Opt 24/6, 2015 at 13:36

I just noticed that you drastically changed the text of your answer since Agner's latest answer. I am not a big fan of such drastic changes. I normally prefer updates. Did you make a new discovery? It seems your answer disagrees with Agner's. I don't like this. I was tempted to remove the accepted answer and leave it unaccepted since I respect both your answers and I don't know enough to say which is correct. Does your answer and Agner's disagree? – Moonlit 24/5, 2016 at 6:56

@Zboson: Yes, I updated after finding official confirmation in Intel's optimization manual that resolved the discrepancy between my testing and Agner's testing. His testing method apparently measures uops in uop-cache, where indexed addressing modes are micro-fused. My testing measures fused-domain uops in the issue stage, after they've been "un-laminated". Indexed addressing modes micro-fuse in the decoders and uop-cache. So we're both technically right. I should send him a mail; I guess he didn't see my comment. His guide should def. mention this. – Opt 24/5, 2016 at 10:14

Yes, send him an email or maybe write on his blog agner.org/optimize/blog – Moonlit 24/5, 2016 at 11:42

It's not clear to me from your answer what IACA is right and wrong about. Can you explain it in one or two sentences? BTW, your answer on deoptimizing was in the top 30 on news.ycombinator.com yesterday. Here is the discussion news.ycombinator.com/item?id=11749756 – Moonlit 24/5, 2016 at 13:0

Oh, I see you already commented to that thread. You're active on news.ycombinator.com as well? – Moonlit 24/5, 2016 at 13:6

@Zboson: That IACA section was badly worded, thanks for pointing that out. Fixed. re: ycombinator: No, don't even follow ycombinator at all. I think you or someone else pointed me to that thread yesterday, so I registered to leave a couple replies. – Opt 24/5, 2016 at 13:19

So will these indexed mode instructions still un-laminate on SKL? There were some changes there (building on related changes in BDW) that allow the RS to handle ops with 3-input dependencies. For eg, CMOV now generates only one uop, whereas before it was 2. Similarly for a few other 3-input instructions. So perhaps the un-laminating has been eliminated now. – Gimble 26/8, 2016 at 23:52

@BeeOnRope: IIRC, Intel's optimization manual has a lot of specific stuff to say about SKL in the same section as that un-laminating on SnB-family, and that isn't one of them, IIRC. I think it's not just the extra register dependency, but also bits that say which addressing mode it is, and the scale-factor bits. Even [disp32 + index] one-register addressing modes are un-laminated, so it's not just a matter of tracking an extra dependency. – Opt 27/8, 2016 at 0:15

@BeeOnRope: Also, with micro-fusion, they're always separate in the unfused-domain RS (scheduler) where uops wait for their inputs and port to be ready. They're only fused in the ROB where they wait to retire. Converting adc and cmov to be one uop affected even the unfused-domain. Adding extra bits to uop format in the ROB would probably have required a lot of redesign time. (And redesign in other places, too, like no longer un-laminating in the issue stage). Still, it's something we can hope Intel does eventually. – Opt 27/8, 2016 at 0:18

@PeterCordes - right, the changes to allow cmov and friends to be single-uop (across all domains) were larger than just the 3-arg stuff, but it seems like one part of that change may have been to increase the size of the entry in the ROB to accomodate 3-arg uops: since cmov and friends allow all the complex instruction modes, perhaps they wouldn't fit in the ROB otherwise. That could have the side effect of also allowing stuff like complex modes to avoid unlamination. Anyway, I'm about to test it. – Gimble 2/9, 2016 at 2:37

BTW - this note: "I also found that the uop bandwidth out of the loop buffer isn't a constant 4 per cycle, if the loop isn't a multiple of 4 uops." - is perhaps the most interesting finding above, even though it isn't directly related to the above! Have you seen any confirmation elsewhere? If true, it should be a prominent item in optimization guides, reflected in IACA, etc. You seem to indicate that you think it is a limitation of the loop buffer only, but it would be very odd if the loop buffer had worse throughput than the uop buffer or legacy decode modes. Worth an investigation... – Gimble 2/9, 2016 at 2:38

@BeeOnRope: A 7-uop loop will issue groups of 4|3|4|3|... I haven't tested larger loops (that don't fit in the loop buffer) to see if it's possible for the first instruction from the next iteration to issue in the same group as the taken-branch to it, but I assume not. I think the point is that the loop buffer really can provide a guaranteed 4u/c, while the other uop sources don't even without branches. (uop cache-line boundaries can limit the front-end, since at least SnB is only capable of reading from 1 ucache line per clock, and only reads 4 of the up-to-6 uops) SKL is different, IIRC – Opt 2/9, 2016 at 2:58

The partial-group-at-the-end effect is a bigger factor for tiny loops, because 5 uops per 2 cycles is way worse than 97 uops per 25 cycles (assuming perfect uop-cache throughput). This may be why it's worth specifically mentioning. – Opt 2/9, 2016 at 3:0

Yeah, it definitely adds another factor to consider in decisions like "unroll vs not unroll" (in favor of unroll) and also in instruction selection (e.g., you might prefer an 8-uop loop over a 7-uop loop because it is better in some other criteria such as instruction size, HT friendliness, cache friendliness, power use, AMD performance, whatever). – Gimble 2/9, 2016 at 3:8

It seems like Skylake has fixed the unlamination issue. My test of your code shows the exact same results for both the 1-arg and 2-arg versions. I'll put my results as another answer, but feel free to edit it into your answer too. – Gimble 2/9, 2016 at 5:17

@PeterCordes - your finding about loop bodies is interesting enough to warrant a separate question, which I posted here. I'm investigating the answer for Skylake, which is the only hardware I have available. Well I think I have some SB too (~2013 MacBook Air?) but it's running OSX. – Gimble 3/9, 2016 at 22:30

BTW, I posted over on Agner's blog about this issue, linking to this question, with the idea that this would be great to cover in his manual. I think based on the investigation here the issue is pretty much fully understood - but having it in what is kind of the canonical (or at least best) source would be ideal. – Gimble 10/9, 2016 at 0:38

@PeterCordes - I posted the results of my investigation on the multiple-of-4 issue, covering not only the the LSD but the legacy decoder and uop cache too. The summary on Skylake is that indeed the LSD has various restrictions, but it is far from as simple as "must be a multiple of 4". For example, a 7 uop list required 2 cycles, as you'd expect from the simple 4N interpretation, but a 9 uop loop required 2.3 cycles (not the 3 you'd expect if it was rounded to a 12 uops). More mysteries abound in the LSD. The DSB and legacy decode were simpler. – Gimble 9/10, 2016 at 7:16

Working on an update to this: HSW/SKL can only keep a uop micro-fused if it has 2 operands and treats the dst register as read-modify-write. e.g. paddb xmm0, [rdi+rbx] but not vpaddb xmm0, xmm0, [rdi+rbx] or pabsb xmm0, [rdi+rdx]. – Opt 17/6, 2017 at 14:43

@Peter - Huh so that substantially rules out VEX encoded stuff, which is generally three argument? Kind of unfortunate because before knowing that VEX seemed like a clear win. – Gimble 20/6, 2017 at 15:50

@BeeOnRope: Yeah, it's an unfortunate downside to VEX :(. I forgot to clarify that it's still only a problem for indexed addressing modes, and only for ALU+load. VEX stores stay micro-fused. You can usually just unroll to amortize the cost of incrementing more pointers and use vpaddb xmm0, xmm0, [rdi+32], which stays micro-fused even on SnB/IvB. But even with indexed addressing modes, VEX non-destructive 3-operand is still usually a win for front-end throughput because of avoiding so many MOVDQA reg,reg uops. – Opt 20/6, 2017 at 17:57

@PeterCordes - I'm not sure if you're still working on a re-write, but I added the new info you discovered about HSW/SKL and 2-operand RMW to "the rules" because it's too important to be lost down here in a comment, I think. It really expands the cases where fusion doens't happen: not fusing for VEX-encoded memory source ops is pretty important to keep in mind. – Gimble 18/11, 2017 at 18:57

@BeeOnRope: thanks, I'd been meaning to make a minor edit with that before I get back to the big edit. The current state of the text I'm working on includes a complete explanation of what micro-fusion is, because Agner's description is incomplete and doesn't descripe where un-lamination happens (or mention it at all). It was getting so big that it's a bit daunting to get back to. It's hard to limit it to just describing the behaviour without spending too much time on optimization advice like when you'd want to use indexed addressing modes or not. – Opt 19/11, 2017 at 3:35

The condition about "... and treats the dst register as read-modify-write" is a bit subtle. For example popcnt and bsf can both micro-fuse with indexed addressing modes, even though the former has a write-only destination, and latter is mostly write-only. I guess the "false dependency" saves them from unlamination? On the other hand, tzcnt doesn't micro-fuse (and it has no false dependency) so that's one way bsf is better than tzcnt! – Gimble 22/5, 2019 at 6:5

BTW, I tested most of the cases you mention on CannonLake and everything is the same as Skylake, except that now popcnt does unlaminate, which lines up with the fact that the false dep for popcnt is fixed in CNL. – Gimble 22/5, 2019 at 6:33

@PeterCordes The raw event that you used r1c2 is documented as Counts the number of micro-ops retired, (macro-fused=1, micro-fused=2, others=1; maximum count of 8 per cycle). only for CPUs with DisplayModel_DisplayFamily: 06_1AH, 06_1EH, 06_1FH, and 06_2EH. On my WhL it is 06_8e. Can the event be reliably used on SkL/KbL/WhL uarchs? In the event table list of my architecture (also SkL and KbL) it as UOPS_RETIRED.TOTAL_CYCLES with the signature cpu/event=0xc2,umask=0x1,cmask=0x10,inv/ and there is no UOPS_RETIRED.ANY. – Including 1/3, 2020 at 11:9

@PeterCordes I think it is inappropriate to use r1c2, r2c2 at least on WhL with DisplayFamily_DisplayModel 06_8e as it is not documented in the related event table and Intel Manual Vol.3/19 clearly states: All performance event encodings not documented in the appropriate tables for the given processor are considered reserved, and their use will result in undefined counter updates with associated overflow actions. – Including 1/3, 2020 at 11:31

@St.Antario: IDK, the r1/2c2 data in this answer was collected on SnB (i5-2500k). I generally just use named events provided by perf, or previously ocperf.py, without getting into the details of event/umask programming. On my SKL, the event you want is uops_retired.retire_slots, like I used in the SKL section here. See also Can x86's MOV really be "free"? Why can't I reproduce this at all? where I used uops_issued_any:u and uops_executed.thread:u. (Issue and retire fused-domain counts are the same if there are no mispredicts or other rollbacks.) – Opt 1/3, 2020 at 16:32

@PeterCordes There are no issues with uops_executed.any/uops_issued.thread counters. Both are presented in the perf list. UOPS_RETIRED.RETIRE_SLOTS was also presented and correponded to the r2c2 raw counter on SkL/KbL/CfL. It was documented as Counts the number of retirement slots used each cycle. If the retirement slots used are accounted with respect to microfusion then this is the counter I'm looking for. But the documentation is not that detailed anyway... – Including 1/3, 2020 at 16:49

@St.Antario: Yeah, I had to test it to see that it counted 1 for an instruction that stayed micro-fused, and 2 for instructions that un-laminated or couldn't micro-fuse in the first place. But we do know that retirement happens in the fused domain so the "retire slots" name makes sense. – Opt 1/3, 2020 at 16:53

I was wondering where unlamination really occurs and whether instructions that need to be unlaminated can bypass the IDQ i.e. it goes to the allocator whilst it is cached in the queue. In which case it would be the allocator that unlaminates. Which would increase the LSD/IDQ entries available (but not ROB). Alternatively, it appears in the IDQ and then allocator unlaminates instantly (its never not in an unlaminated state in the queue). If unlamination were done while in the IDQ (I.e. enters laminated and then unlaminates) it would mean a 1 cycle delay and wouldn't be able to bypass the IDQ – Bechance 13/1, 2021 at 11:19

@LewisKelsey: I think I tried to test LSD capacity on HSW at some point but I don't remember what I found; hopefully I wrote down the results somewhere. But IIRC, there's some evidence that un-lamination can take an extra cycle in the front-end sometimes for the extra uop, like a loop that needs un-lamination can be slower than a loop which isn't fused in the first place. Perhaps on one of Andreas Abel's questions? That would indicate that un-lamination doesn't happen until after the IDQ. But it's been a while since I looked, and I forget how well we ruled out alternative explanations. – Opt 13/1, 2021 at 11:24

Or maybe I'm just remembering those ideas that I already mentioned as TODO items in this answer but haven't actually done. I don't have a machine with an LSD I can test on, unless I somehow boot my SKL with old microcode. – Opt 13/1, 2021 at 11:27

@PeterCordes re:"with some notes on vpgatherdd being about 1.7x more cycles than a pinsrw loop" any chance you have link with more information on this? (I.e didn't see any references in the earlier version). – Gal 9/3, 2021 at 22:59

@Noah: Version 8 looks like the last version with those 2 paragraphs: stackoverflow.com/revisions/31027695/8 with a link to github.com/pcordes/par2-asm-experiments.git. (That's probably better done with pshufb for register LUTs, because IIRC the problem is separable into 4-bit chunks. That was one of the first asm-optimization things I played with after getting into Agner Fog's manuals; before that I knew about asm in general, and some about CPU architecture, but hadn't seriously put it to use for optimization. So I didn't realize how good pshufb was.) – Opt 9/3, 2021 at 23:59

@PeterCordes and yet you where already providing fantastic answers like this one! But nice to see you are human too xD – Gal 10/3, 2021 at 0:27

@PeterCordes if you have the words renaming bubbles from unlamination might be a useful link in the in the further reading / performance impact. – Gal 17/3, 2021 at 21:21

Note: Since I wrote this answer, Peter tested Haswell and Skylake as well and integrated the results into the accepted answer above (in particular, most of the improvements I attribute to Skylake below seem to have actually appeared in Haswell). You should see that answer for the rundown of behavior across CPUs and this answer (although not wrong) is mostly of historical interest.

My testing indicates that on Skylake at least¹, the processor fully fuses even complex addressing modes, unlike Sandybridge.

That is, the 1-arg and 2-arg versions of the code posted above by Peter run in the same number of cycles, with the same number of uops dispatched and retired.

My results:

Performance counter stats for ./uop-test:

     23.718772      task-clock (msec)         #    0.973 CPUs utilized          
    20,642,233      cycles                    #    0.870 GHz                    
    80,111,957      instructions              #    3.88  insns per cycle        
    60,253,831      uops_executed_thread      # 2540.344 M/sec                  
    80,295,685      uops_issued_any           # 3385.322 M/sec                  
    80,176,940      uops_retired_retire_slots # 3380.316 M/sec                  

   0.024376698 seconds time elapsed

Performance counter stats for ./uop-test x:

     13.532440      task-clock (msec)         #    0.967 CPUs utilized          
    21,592,044      cycles                    #    1.596 GHz                    
    80,073,676      instructions              #    3.71  insns per cycle        
    60,144,749      uops_executed_thread      # 4444.487 M/sec                  
    80,162,360      uops_issued_any           # 5923.718 M/sec                  
    80,104,978      uops_retired_retire_slots # 5919.478 M/sec                  

   0.013997088 seconds time elapsed

Performance counter stats for ./uop-test x x:

     16.672198      task-clock (msec)         #    0.981 CPUs utilized          
    27,056,453      cycles                    #    1.623 GHz                    
    80,083,140      instructions              #    2.96  insns per cycle        
    60,164,049      uops_executed_thread      # 3608.645 M/sec                  
   100,187,390      uops_issued_any           # 6009.249 M/sec                  
   100,118,409      uops_retired_retire_slots # 6005.112 M/sec                  

   0.016997874 seconds time elapsed

I didn't find any UOPS_RETIRED_ANY instruction on Skylake, only the "retired slots" guy which is apparently fused-domain.

The final test (uop-test x x) is a variant that Peter suggestions which uses a RIP-relative cmp with immediate, which is known not to microfuse:

.loop_riprel
    cmp dword [rel mydata], 1
    cmp dword [rel mydata], 2
    dec ecx
    nop
    nop
    nop
    nop
    jg .loop_riprel

The results show that the extra 2 uops per cycle are picked up by the uops issued and retired counters (hence the test can differentiate between fusion occurring, and not).

More tests on other architectures are welcome! You can find the code (copied from Peter above) in github.

[1] ... and perhaps some other architectures in-between Skylake and Sandybridge, since Peter only tested SB and I only tested SKL.

Gimble answered 2/9, 2016 at 5:31 Comment(10)

Did you test any cases that are known not to micro-fuse in the first place? e.g. RIP-relative with immediate? (either read-modify-write ALU, mov store, or cmp/test mem, imm8). It would be very good to confirm that your perf-counter results do show the difference between micro-fusion and no micro-fusion. – Opt 2/9, 2016 at 5:44

I'm not following this part: (either read-modify-write ALU, mov store, or cmp/test mem, imm8). Is RIP-relative + immediate enough to avoid micro-fusion or do I need to use it in one of those instructions? – Gimble 2/9, 2016 at 5:53

You need a RIP-relative and an immediate in the same insn. There are three different cases: store-only (mov dword [rel symbol], 1234), load-only (cmp dword [rel symbol], 1), and read-modify-write (or dword [rel symbol], 1). There are also some instructions that apparently never micro-fuse, according to Agner's tables. e.g. shlx r,m,i is 2 uops in fused and unfused domains, but only 1 uop with a register src. Similarly, pblendw is like this. pinsrb/w/d/q is either 2p5 (red src) or p5+p23 (mem src). – Opt 2/9, 2016 at 6:5

BTW, Agner already documents cases that never micro-fuse in the first place, since his testing method detected them. I think he was testing via uop cache consumption, which is why he didn't detect unlamination. – Opt 2/9, 2016 at 6:14

OK, I tested the cmp [sym], 1 variant and indeed it shows 2 more uops issued and retired per loop (i.e., the last two counters above), and an increase in cycles. Other counters unchanged. – Gimble 2/9, 2016 at 6:14

Updated question with results and github repo. – Gimble 2/9, 2016 at 6:26

@PeterCordes - missed your first question. Yeah I got the names from ocperf.py. The missing events seem consistent with other lists I've seen. I didn't follow the NOP comment - the existing "retired" counters all already seem to count the NOPs? Only the "uops issued" counter seems to ignore them (indeed, since they are eliminated prior to issue). Do NOPs take an entry in the ROB? – Gimble 2/9, 2016 at 6:30

You're right, that was nonsense. It's been a while since I looked at my test code and numbers in detail. I guess NOPs take ROB entries. You have to be able to jmp to them, so they definitely need uop-cache entries. There doesn't seem to be any need for an interrupt to be able to happen between two NOPs, but x86 has lots of corner cases. (e.g. mov ss, reg disables interrupts until after the next instruction.) Since running NOPs isn't usually a performance bottleneck, presumably Intel just let them go through the pipe instead of totally hiding them. – Opt 2/9, 2016 at 6:38

NOPs do issue and retire, but they don't dispatch to any execution unit. That's what we actually see in the perf results. They're not really "eliminated"; only in the sense that the work of a MOV or XOR-zeroing is eliminated, not the whole uop. So I guess they enter the ROB in an already-executed state, and never enter the RS (scheduler, aka Reservation Station) – Opt 2/9, 2016 at 6:39

Updated my answer with test results from a Haswell laptop and my SKL desktop. HSW can micro-fuse indexed addressing modes the same way SKL can. IACA is wrong. – Opt 13/5, 2017 at 1:31

Older Intel processors without a uop cache can do the fusion, so maybe this is a drawback of the uop cache. I don't have the time to test this right now, but I will add a test for uop fusion next time I update my test scripts. Have you tried with FMA instructions? They are the only instructions that allow 3 input dependencies in an unfused uop.

Cossack answered 12/7, 2015 at 5:49 Comment(2)

I haven't. I don't have a Haswell CPU. >.< But that's an excellent point, fusion rules might be different. – Opt 13/7, 2015 at 9:14

@PeterCordes, I orginally discovered this from a question using FMA. See the part when I discuss Stephen Canon's comment. He suggested ""using the store address as the offset for the load operands." which allows the store to use port 7. However, this does not fuse so it's no better. The only solution which allowed me to have four fused microps (6 total) was Evgeny Kluev suggestion using a static array and one register mode. I asked this question because of that question. – Moonlit 14/7, 2015 at 10:35

I have now reviewed test results for Intel Sandy Bridge, Ivy Bridge, Haswell and Broadwell. I have not had access to test on a Skylake yet. The results are:

Instructions with two-register addressing and three input dependencies are fusing allright. They take only one entry in the micro-operation cache as long as they contain no more than 32 bits of data (or 2 * 16 bits).
It is possible to make instructions with four input dependencies, using fused multiply-and-add instructions on Haswell and Broadwell. These instructions still fuse into a single micro-op and take only one entry in the micro-op cache.
Instructions with more than 32 bits of data, for example 32 bits address and 8 bits immediate data can still fuse, but use two entries in the micro-operation cache (unless the 32 bits can be compressed into a 16-bit signed integer)
Instructions with rip-relative addressing and an immediate constant are not fusing, even if both the offset and the immediate constant are very small.
All the results are identical on the four machines tested.
The tests were performed with my own test programs using the performance monitoring counters on loops that were sufficiently small to fit into the micro-op cache.

Your results may be due to other factors. I have not tried to use the IACA.

Cossack answered 1/12, 2015 at 14:54 Comment(3)

I was using small ~8 uop loops on SnB, and looking at the perf counters for fused and unfused domain uops. Can you see anything wrong with my test code (posted in my answer)? I was using instructions like or eax, [rsi + 4 + rdi], which only has 32bits of data (the offset). Was I looking at the wrong perf counter or something? The change in observed behaviour (cycles to run the loop) matches up with fusion not happening -> loop takes more cycles per iteration because of the 4-wide pipe. And fused-domain matches unfused-domain counts. – Opt 1/12, 2015 at 15:27

I was testing fused-domain uops against the 4-wide limit of the pipeline for issuing / retiring 4 fused-domain uops per clock. Is it possible that the uop cache can fuse better than the rest of the pipeline? My test was with tiny loops, which fit in the loop buffer, so the uop cache shouldn't have been directly involved. – Opt 1/12, 2015 at 15:56

Intel's optimization manual confirms that micro-fusion happens in the decoders, but indexed addressing modes are "un-laminated" as they issue. Others stay fused. So micro-fusion doesn't help when the 4-wide issue/retire throughput is the bottleneck, nor does it help with fitting more insns into the ROB. See my updated answer. – Opt 28/3, 2016 at 17:30

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Experimental testing on SnB

Further stuff

Recommended topics

Hot tags