Micro fusion and addressing modes
Asked Answered
M

4

63

I have found something unexpected (to me) using the Intel® Architecture Code Analyzer (IACA).

The following instruction using [base+index] addressing

addps xmm1, xmmword ptr [rsi+rax*1]

does not micro-fuse according to IACA. However, if I use [base+offset] like this

addps xmm1, xmmword ptr [rsi]

IACA reports that it does fuse.

Section 2-11 of the Intel optimization reference manual gives the following as an example "of micro-fused micro-ops that can be handled by all decoders"

FADD DOUBLE PTR [RDI + RSI*8]

and Agner Fog's optimization assembly manual also gives examples of micro-op fusion using [base+index] addressing. See, for example, Section 12.2 "Same example on Core2". So what's the correct answer?

Moonlit answered 25/9, 2014 at 19:33 Comment(16)
Downvoter please explain yourself. Not all of us have time to test everything through experiment.Moonlit
@IwillnotexistIdonotexist, I am trying to write tests to check this. Currently I have a case where IACA says the fused version has a block throughput of 2.0 and the non-fused version 6.0 but they both take the same time in practice. I am leaning towards the side that IACA has a bug. But if you find something please let me know.Moonlit
@IwillnotexistIdonotexist, did you get a chance to look into this? I can't give out 500 point bountys every day :-/Moonlit
I genuinely don't know; I've been quite stumped on this problem the past few days although somebody dropped this useful Haswell diagram below your older question's answer. That fills my sails slightly - Micro/macrofusion happens at decode time and the ROB can't assist.Crofoot
@IwillnotexistIdonotexist, that's a cool diagram! Thanks! Maybe I should just post a message on IACA forums about this.Moonlit
I'm grasping at straws - the section you quote out of the Intel optimization manual is under "Sandy Bridge". Did you try running IACA with the flag -arch SNB for the example instructions, and addps xmm1, xmmword ptr [rsi+rax*1]?Crofoot
For kicks I tried having IACA analyze the examples that Intel alleges will microfuse, but it turns out IACA claims fadd st0, qword ptr [rdi+rsi*8] does not microfuse, whether alone or unrolled 20 times. Don't know what to make of this. EDIT: That goes for all architectures: NHM, WSM, SNB, IVB and HSW.Crofoot
For that matter, ret also is claimed to microfuse but doesn't according IACA, whereas jmp [rdi+200] does indeed microfuse.Crofoot
And shockingly an instruction claimed not to microfuse (cmp dword ptr [rip-0x43], 0x1b) does microfuse according to IACA on both SNB and HSW! I think there's something seriously wrong in either the manual or IACA, and our next step is to experimentally determine who is right (IACA or the manual).Crofoot
@IwillnotexistIdonotexist, yeah we need an experiment. My triad function is no good as it is now because on Core2-IB it needs 2 cycles with or without micro-op fusion anyway. On Haswell we already have an experiment to show that the fusion is simple on port 7 and if we fix the triad function to use port 7 it needs a compare which means port 6 takes two cycles. So some modification to the triad function is necessary or a new test altogether.Moonlit
@IwillnotexistIdonotexist: the Intel manuals were probably written before SnB. . Sandybridge switched to a physical register file, made major under-the-hood changes to how uops are tracked. This came up in a discussion recently: stackoverflow.com/questions/31875464/…. Perf-counter experiments on SnB show that IACA is right. (except for rip-relative, glad you brought that up). I'm still waiting to hear if Skylake changed anything on this front.Opt
@PeterCordes, I tested Nehalem as well. It does not fuse either using two registers. This problem goes back further than SNB. Though IwillnotexistIdonotexist already said about "That goes for all architectures: NHM, WSM, SNB, IVB and HSW". So I guess Intel's manual was written before Nehalem even.Moonlit
@Zboson: Are you sure about Nehalem? In Agner Fog's answer on this question, he says that older Intel CPUs without a uop cache can do the fusion. Maybe Intel changed the internal uop format for Nehalem's 28uop loop buffer? IACA does show it not fusing on NHM. You tested with actual perf counters, though?Opt
@PeterCordes, I only used IACA. I did not do any tests. Good point. I am assuming that IACA is right. Do you have proof otherwise (did I miss this in your answer)? My triad function on NHM - IVB needs at least two cycles due to the loads/stores on the same port so not-fusing is not an issue. It only matters since HSW (I resubmitted this comment due to some errors).Moonlit
@Zboson: For Nehalem, no. I only personally tested uops with perf counters on SnB. IACA is known to be unreliable, so I wouldn't trust it in the face of other evidence: Agner Fog's statement, and the fact that Sandybridge was when Intel made major changes to the internals (including the uop format IIRC what I read). SnB is generally considered the point at which P6 evolved into a new species of microarchitecture.Opt
Regarding the initial downvote, there appears to be a crop of militants on SO who summarily downvote any/everything that could be perceived as being related to micro-optimization. What they perhaps neglect to understand is that, despite the inherent value and importance of such study, it can also be fun.Tranquillize
O
52

In the decoders and uop-cache, addressing mode doesn't affect micro-fusion (except that an instruction with an immediate operand can't micro-fuse a RIP-relative addressing mode).

But some combinations of uop and addressing mode can't stay micro-fused in the ROB (in the out-of-order core), so Intel SnB-family CPUs "un-laminate" when necessary, at some point before the issue/rename stage. For issue-throughput, and out-of-order window size (ROB-size), fused-domain uop count after un-lamination is what matters.

Intel's optimization manual describes un-lamination for Sandybridge in Section E.2.2.4: Micro-op Queue and the Loop Stream Detector (LSD), but doesn't describe the changes for any later microarchitectures.

UPDATE: Now Intel manual has a detailed section to describe un-lamination for Haswell. See section E.1.5 Unlamination. And a brief description for SandyBridge is in section E.2.2.4.


The rules, as best I can tell from experiments on SnB, HSW, and SKL:

  • SnB (and I assume also IvB): indexed addressing modes are always un-laminated, others stay micro-fused. IACA is (mostly?) correct.
  • HSW, SKL: These only keep an indexed ALU instruction micro-fused if it has 2 operands and treats the dst register as read-modify-write. Here "operands" includes flags, meaning that adc and cmov don't micro-fuse. Most VEX-encoded instructions also don't fuse since they generally have three operands (so paddb xmm0, [rdi+rbx] fuses but vpaddb xmm0, xmm0, [rdi+rbx] doesn't). Finally, the occasional 2-operand instruction where the first operand is write only, such as pabsb xmm0, [rax + rbx] also do not fuse. IACA is wrong, applying the SnB rules.

Related: simple (non-indexed) addressing modes are the only ones that the dedicated store-address unit on port7 (Haswell and later) can handle, so it's still potentially useful to avoid indexed addressing modes for stores. (A good trick for this is to address your dst with a single register, but src with dst+(initial_src-initial_dst). Then you only have to increment the dst register inside a loop.)

Note that some instructions never micro-fuse at all (even in the decoders/uop-cache). e.g. shufps xmm, [mem], imm8, or vinsertf128 ymm, ymm, [mem], imm8, are always 2 uops on SnB through Skylake, even though their register-source versions are only 1 uop. This is typical for instructions with an imm8 control operand plus the usual dest/src1, src2 register/memory operands, but there are a few other cases. e.g. PSRLW/D/Q xmm,[mem] (vector shift count from a memory operand) doesn't micro-fuse, and neither does PMULLD.

See also this post on Agner Fog's blog for discussion about issue throughput limits on HSW/SKL when you read lots of registers: Lots of micro-fusion with indexed addressing modes can lead to slowdowns vs. the same instructions with fewer register operands: one-register addressing modes and immediates. We don't know the cause yet, but I suspect some kind of register-read limit, maybe related to reading lots of cold registers from the PRF.


Test cases, numbers from real measurements: These all micro-fuse in the decoders, AFAIK, even if they're later un-laminated.

# store
mov        [rax], edi  SnB/HSW/SKL: 1 fused-domain, 2 unfused.  The store-address uop can run on port7.
mov    [rax+rsi], edi  SnB: unlaminated.  HSW/SKL: stays micro-fused.  (The store-address can't use port7, though).
mov [buf +rax*4], edi  SnB: unlaminated.  HSW/SKL: stays micro-fused.

# normal ALU stuff
add    edx, [rsp+rsi]  SnB: unlaminated.  HSW/SKL: stays micro-fused.  
# I assume the majority of traditional/normal ALU insns are like add

Three-input instructions that HSW/SKL may have to un-laminate

vfmadd213ps xmm0,xmm0,[rel buf] HSW/SKL: stays micro-fused: 1 fused, 2 unfused.
vfmadd213ps xmm0,xmm0,[rdi]     HSW/SKL: stays micro-fused
vfmadd213ps xmm0,xmm0,[0+rdi*4] HSW/SKL: un-laminated: 2 uops in fused & unfused-domains.
     (So indexed addressing mode is still the condition for HSW/SKL, same as documented by Intel for SnB)

# no idea why this one-source BMI2 instruction is unlaminated
# It's different from ADD in that its destination is write-only (and it uses a VEX encoding)
blsi   edi, [rdi]       HSW/SKL: 1 fused-domain, 2 unfused.
blsi   edi, [rdi+rsi]   HSW/SKL: 2 fused & unfused-domain.


adc         eax, [rdi] same as cmov r, [rdi]
cmove       ebx, [rdi]   Stays micro-fused.  (SnB?)/HSW: 2 fused-domain, 3 unfused domain.  
                         SKL: 1 fused-domain, 2 unfused.

# I haven't confirmed that this micro-fuses in the decoders, but I'm assuming it does since a one-register addressing mode does.

adc   eax, [rdi+rsi] same as cmov r, [rdi+rsi]
cmove ebx, [rdi+rax]  SnB: untested, probably 3 fused&unfused-domain.
                      HSW: un-laminated to 3 fused&unfused-domain.  
                      SKL: un-laminated to 2 fused&unfused-domain.

I assume that Broadwell behaves like Skylake for adc/cmov.

It's strange that HSW un-laminates memory-source ADC and CMOV. Maybe Intel didn't get around to changing that from SnB before they hit the deadline for shipping Haswell.

Agner's insn table says cmovcc r,m and adc r,m don't micro-fuse at all on HSW/SKL, but that doesn't match my experiments. The cycle counts I'm measuring match up with the the fused-domain uop issue count, for a 4 uops / clock issue bottleneck. Hopefully he'll double-check that and correct the tables.

Memory-dest integer ALU:

add        [rdi], eax  SnB: untested (Agner says 2 fused-domain, 4 unfused-domain (load + ALU  + store-address + store-data)
                       HSW/SKL: 2 fused-domain, 4 unfused.
add    [rdi+rsi], eax  SnB: untested, probably 4 fused & unfused-domain
                       HSW/SKL: 3 fused-domain, 4 unfused.  (I don't know which uop stays fused).
                  HSW: About 0.95 cycles extra store-forwarding latency vs. [rdi] for the same address used repeatedly.  (6.98c per iter, up from 6.04c for [rdi])
                  SKL: 0.02c extra latency (5.45c per iter, up from 5.43c for [rdi]), again in a tiny loop with dec ecx/jnz


adc     [rdi], eax      SnB: untested
                        HSW: 4 fused-domain, 6 unfused-domain.  (same-address throughput 7.23c with dec, 7.19c with sub ecx,1)
                        SKL: 4 fused-domain, 6 unfused-domain.  (same-address throughput ~5.25c with dec, 5.28c with sub)
adc     [rdi+rsi], eax  SnB: untested
                        HSW: 5 fused-domain, 6 unfused-domain.  (same-address throughput = 7.03c)
                        SKL: 5 fused-domain, 6 unfused-domain.  (same-address throughput = ~5.4c with sub ecx,1 for the loop branch, or 5.23c with dec ecx for the loop branch.)

Yes, that's right, adc [rdi],eax / dec ecx / jnz runs faster than the same loop with add instead of adc on SKL. I didn't try using different addresses, since clearly SKL doesn't like repeated rewrites of the same address (store-forwarding latency higher than expected. See also this post about repeated store/reload to the same address being slower than expected on SKL.

Memory-destination adc is so many uops because Intel P6-family (and apparently SnB-family) can't keep the same TLB entries for all the uops of a multi-uop instruction, so it needs an extra uop to work around the problem-case where the load and add complete, and then the store faults, but the insn can't just be restarted because CF has already been updated. Interesting series of comments from Andy Glew (@krazyglew).

Presumably fusion in the decoders and un-lamination later saves us from needing microcode ROM to produce more than 4 fused-domain uops from a single instruction for adc [base+idx], reg.


Why SnB-family un-laminates:

Sandybridge simplified the internal uop format to save power and transistors (along with making the major change to using a physical register file, instead of keeping input / output data in the ROB). SnB-family CPUs only allow a limited number of input registers for a fused-domain uop in the out-of-order core. For SnB/IvB, that limit is 2 inputs (including flags). For HSW and later, the limit is 3 inputs for a uop. I'm not sure if memory-destination add and adc are taking full advantage of that, or if Intel had to get Haswell out the door with some instructions

Nehalem and earlier have a limit of 2 inputs for an unfused-domain uop, but the ROB can apparently track micro-fused uops with 3 input registers (the non-memory register operand, base, and index).


So indexed stores and ALU+load instructions can still decode efficiently (not having to be the first uop in a group), and don't take extra space in the uop cache, but otherwise the advantages of micro-fusion are essentially gone for tuning tight loops. "un-lamination" happens before the 4-fused-domain-uops-per-cycle issue/retire width out-of-order core. The fused-domain performance counters (uops_issued / uops_retired.retire_slots) count fused-domain uops after un-lamination.

Intel's description of the renamer (Section 2.3.3.1: Renamer) implies that it's the issue/rename stage which actually does the un-lamination, so uops destined for un-lamination may still be micro-fused in the 28/56/64 fused-domain uop issue queue / loop-buffer (aka the IDQ).

TODO: test this. Make a loop that should just barely fit in the loop buffer. Change something so one of the uops will be un-laminated before issuing, and see if it still runs from the loop buffer (LSD), or if all the uops are now re-fetched from the uop cache (DSB). There are perf counters to track where uops come from, so this should be easy.

Harder TODO: if un-lamination happens between reading from the uop cache and adding to the IDQ, test whether it can ever reduce uop-cache bandwidth. Or if un-lamination happens right at the issue stage, can it hurt issue throughput? (i.e. how does it handle the leftover uops after issuing the first 4.)


(See the a previous version of this answer for some guesses based on tuning some LUT code, with some notes on vpgatherdd being about 1.7x more cycles than a pinsrw loop.)

Experimental testing on SnB

The HSW/SKL numbers were measured on an i5-4210U and an i7-6700k. Both had HT enabled (but the system idle so the thread had the whole core to itself). I ran the same static binaries on both systems, Linux 4.10 on SKL and Linux 4.8 on HSW, using ocperf.py. (The HSW laptop NFS-mounted my SKL desktop's /home.)

The SnB numbers were measured as described below, on an i5-2500k which is no longer working.

Confirmed by testing with performance counters for uops and cycles.

I found a table of PMU events for Intel Sandybridge, for use with Linux's perf command. (Standard perf unfortunately doesn't have symbolic names for most hardware-specific PMU events, like uops.) I made use of it for a recent answer.

ocperf.py provides symbolic names for these uarch-specific PMU events, so you don't have to look up tables. Also, the same symbolic name works across multiple uarches. I wasn't aware of it when I first wrote this answer.

To test for uop micro-fusion, I constructed a test program that is bottlenecked on the 4-uops-per-cycle fused-domain limit of Intel CPUs. To avoid any execution-port contention, many of these uops are nops, which still sit in the uop cache and go through the pipeline the same as any other uop, except they don't get dispatched to an execution port. (An xor x, same, or an eliminated move, would be the same.)

Test program: yasm -f elf64 uop-test.s && ld uop-test.o -o uop-test

GLOBAL _start
_start:
    xor eax, eax
    xor ebx, ebx
    xor edx, edx
    xor edi, edi
    lea rsi, [rel mydata]   ; load pointer
    mov ecx, 10000000
    cmp dword [rsp], 2      ; argc >= 2
    jge .loop_2reg

ALIGN 32
.loop_1reg:
    or eax, [rsi + 0]
    or ebx, [rsi + 4]
    dec ecx
    nop
    nop
    nop
    nop
    jg .loop_1reg
;   xchg r8, r9     ; no effect on flags; decided to use NOPs instead

    jmp .out

ALIGN 32
.loop_2reg:
    or eax, [rsi + 0 + rdi]
    or ebx, [rsi + 4 + rdi]
    dec ecx
    nop
    nop
    nop
    nop
    jg .loop_2reg

.out:
    xor edi, edi
    mov eax, 231    ;  exit(0)
    syscall

SECTION .rodata
mydata:
db 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff

I also found that the uop bandwidth out of the loop buffer isn't a constant 4 per cycle, if the loop isn't a multiple of 4 uops. (i.e. it's abc, abc, ...; not abca, bcab, ...). Agner Fog's microarch doc unfortunately wasn't clear on this limitation of the loop buffer. See Is performance reduced when executing loops whose uop count is not a multiple of processor width? for more investigation on HSW/SKL. SnB may be worse than HSW in this case, but I'm not sure and don't still have working SnB hardware.

I wanted to keep macro-fusion (compare-and-branch) out of the picture, so I used nops between the dec and the branch. I used 4 nops, so with micro-fusion, the loop would be 8 uops, and fill the pipeline with at 2 cycles per 1 iteration.

In the other version of the loop, using 2-operand addressing modes that don't micro-fuse, the loop will be 10 fused-domain uops, and run in 3 cycles.

Results from my 3.3GHz Intel Sandybridge (i5 2500k). I didn't do anything to get the cpufreq governor to ramp up clock speed before testing, because cycles are cycles when you aren't interacting with memory. I've added annotations for the performance counter events that I had to enter in hex.

testing the 1-reg addressing mode: no cmdline arg

$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,r2c2,r1c2,stalled-cycles-frontend,stalled-cycles-backend ./uop-test

Performance counter stats for './uop-test':

     11.489620      task-clock (msec)         #    0.961 CPUs utilized
    20,288,530      cycles                    #    1.766 GHz
    80,082,993      instructions              #    3.95  insns per cycle
                                              #    0.00  stalled cycles per insn
    60,190,182      r1b1  ; UOPS_DISPATCHED: (unfused-domain.  1->umask 02 -> uops sent to execution ports from this thread)
    80,203,853      r10e  ; UOPS_ISSUED: fused-domain
    80,118,315      r2c2  ; UOPS_RETIRED: retirement slots used (fused-domain)
   100,136,097      r1c2  ; UOPS_RETIRED: ALL (unfused-domain)
       220,440      stalled-cycles-frontend   #    1.09% frontend cycles idle
       193,887      stalled-cycles-backend    #    0.96% backend  cycles idle

   0.011949917 seconds time elapsed

testing the 2-reg addressing mode: with a cmdline arg

$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,r2c2,r1c2,stalled-cycles-frontend,stalled-cycles-backend ./uop-test x

 Performance counter stats for './uop-test x':

         18.756134      task-clock (msec)         #    0.981 CPUs utilized
        30,377,306      cycles                    #    1.620 GHz
        80,105,553      instructions              #    2.64  insns per cycle
                                                  #    0.01  stalled cycles per insn
        60,218,693      r1b1  ; UOPS_DISPATCHED: (unfused-domain.  1->umask 02 -> uops sent to execution ports from this thread)
       100,224,654      r10e  ; UOPS_ISSUED: fused-domain
       100,148,591      r2c2  ; UOPS_RETIRED: retirement slots used (fused-domain)
       100,172,151      r1c2  ; UOPS_RETIRED: ALL (unfused-domain)
           307,712      stalled-cycles-frontend   #    1.01% frontend cycles idle
         1,100,168      stalled-cycles-backend    #    3.62% backend  cycles idle

       0.019114911 seconds time elapsed

So, both versions ran 80M instructions, and dispatched 60M uops to execution ports. (or with a memory source dispatches to an ALU for the or, and a load port for the load, regardless of whether it was micro-fused or not in the rest of the pipeline. nop doesn't dispatch to an execution port at all.) Similarly, both versions retire 100M unfused-domain uops, because the 40M nops count here.

The difference is in the counters for the fused-domain.

  1. The 1-register address version only issues and retires 80M fused-domain uops. This is the same as the number of instructions. Each insn turns into one fused-domain uop.
  2. The 2-register address version issues 100M fused-domain uops. This is the same as the number of unfused-domain uops, indicating that no micro-fusion happened.

I suspect that you'd only see a difference between UOPS_ISSUED and UOPS_RETIRED(retirement slots used) if branch mispredicts led to uops being cancelled after issue, but before retirement.

And finally, the performance impact is real. The non-fused version took 1.5x as many clock cycles. This exaggerates the performance difference compared to most real cases. The loop has to run in a whole number of cycles (on Sandybridge where the LSD is less sophisticated), and the 2 extra uops push it from 2 to 3. Often, an extra 2 fused-domain uops will make less difference. And potentially no difference, if the code is bottlecked by something other than 4-fused-domain-uops-per-cycle.

Still, code that makes a lot of memory references in a loop might be faster if implemented with a moderate amount of unrolling and incrementing multiple pointers which are used with simple [base + immediate offset] addressing, instead of the using [base + index] addressing modes.

Further stuff


RIP-relative with an immediate can't micro-fuse. Agner Fog's testing shows that this is the case even in the decoders / uop-cache, so they never fuse in the first place (rather than being un-laminated).

IACA gets this wrong, and claims that both of these micro-fuse:

cmp dword  [abs mydata], 0x1b   ; fused counters != unfused counters (micro-fusion happened, and wasn't un-laminated).  Uses 2 entries in the uop-cache, according to Agner Fog's testing
cmp dword  [rel mydata], 0x1b   ; fused counters ~= unfused counters (micro-fusion didn't happen)

(There are some more limits for micro+macro fusion to both happen for a cmp/jcc. TODO: write that up for testing a memory location.)

RIP-rel does micro-fuse (and stay fused) when there's no immediate, e.g.:

or  eax, dword  [rel mydata]    ; fused counters != unfused counters, i.e. micro-fusion happens

Micro-fusion doesn't increase the latency of an instruction. The load can issue before the other input is ready.

ALIGN 32
.dep_fuse:
    or eax, [rsi + 0]
    or eax, [rsi + 0]
    or eax, [rsi + 0]
    or eax, [rsi + 0]
    or eax, [rsi + 0]
    dec ecx
    jg .dep_fuse

This loop runs at 5 cycles per iteration, because of the eax dep chain. No faster than a sequence of or eax, [rsi + 0 + rdi], or mov ebx, [rsi + 0 + rdi] / or eax, ebx. (The unfused and the mov versions both run the same number of uops.) Scheduling / dep checking happens in the unfused-domain. Newly issued uops go into the scheduler (aka Reservation Station (RS)) as well as the ROB. They leave the scheduler after dispatching (aka being sent to an execution unit), but stay in the ROB until retirement. So the out-of-order window for hiding load latency is at least the scheduler size (54 unfused-domain uops in Sandybridge, 60 in Haswell, 97 in Skylake).

Micro-fusion doesn't have a shortcut for the base and offset being the same register. A loop with or eax, [mydata + rdi+4*rdi] (where rdi is zeroed) runs as many uops and cycles as the loop with or eax, [rsi+rdi]. This addressing mode could be used for iterating over an array of odd-sized structs starting at a fixed address. This is probably never used in most programs, so it's no surprise that Intel didn't spend transistors on allowing this special-case of 2-register modes to micro-fuse. (And Intel documents it as "indexed addressing modes" anyway, where a register and scale factor are needed.)


Macro-fusion of a cmp/jcc or dec/jcc creates a uop that stays as a single uop even in the unfused-domain. dec / nop / jge can still run in a single cycle but is three uops instead of one.

Opt answered 24/6, 2015 at 13:17 Comment(39)
Too bad consumer Skylake processors won't have AVX512. AVX-512 is a lot less interesting now.Moonlit
yeah, my sentiments exactly. I'm hoping Skylake Xeons will come out around the same time as desktop. A Haswell "workstation" with a xeon CPU doesn't cost much more than quality desktop, and you can use ECC RAM without limiting yourself to an i3.Opt
I just noticed that you drastically changed the text of your answer since Agner's latest answer. I am not a big fan of such drastic changes. I normally prefer updates. Did you make a new discovery? It seems your answer disagrees with Agner's. I don't like this. I was tempted to remove the accepted answer and leave it unaccepted since I respect both your answers and I don't know enough to say which is correct. Does your answer and Agner's disagree?Moonlit
@Zboson: Yes, I updated after finding official confirmation in Intel's optimization manual that resolved the discrepancy between my testing and Agner's testing. His testing method apparently measures uops in uop-cache, where indexed addressing modes are micro-fused. My testing measures fused-domain uops in the issue stage, after they've been "un-laminated". Indexed addressing modes micro-fuse in the decoders and uop-cache. So we're both technically right. I should send him a mail; I guess he didn't see my comment. His guide should def. mention this.Opt
Yes, send him an email or maybe write on his blog agner.org/optimize/blogMoonlit
It's not clear to me from your answer what IACA is right and wrong about. Can you explain it in one or two sentences? BTW, your answer on deoptimizing was in the top 30 on news.ycombinator.com yesterday. Here is the discussion news.ycombinator.com/item?id=11749756Moonlit
Oh, I see you already commented to that thread. You're active on news.ycombinator.com as well?Moonlit
@Zboson: That IACA section was badly worded, thanks for pointing that out. Fixed. re: ycombinator: No, don't even follow ycombinator at all. I think you or someone else pointed me to that thread yesterday, so I registered to leave a couple replies.Opt
So will these indexed mode instructions still un-laminate on SKL? There were some changes there (building on related changes in BDW) that allow the RS to handle ops with 3-input dependencies. For eg, CMOV now generates only one uop, whereas before it was 2. Similarly for a few other 3-input instructions. So perhaps the un-laminating has been eliminated now.Gimble
@BeeOnRope: IIRC, Intel's optimization manual has a lot of specific stuff to say about SKL in the same section as that un-laminating on SnB-family, and that isn't one of them, IIRC. I think it's not just the extra register dependency, but also bits that say which addressing mode it is, and the scale-factor bits. Even [disp32 + index] one-register addressing modes are un-laminated, so it's not just a matter of tracking an extra dependency.Opt
@BeeOnRope: Also, with micro-fusion, they're always separate in the unfused-domain RS (scheduler) where uops wait for their inputs and port to be ready. They're only fused in the ROB where they wait to retire. Converting adc and cmov to be one uop affected even the unfused-domain. Adding extra bits to uop format in the ROB would probably have required a lot of redesign time. (And redesign in other places, too, like no longer un-laminating in the issue stage). Still, it's something we can hope Intel does eventually.Opt
@PeterCordes - right, the changes to allow cmov and friends to be single-uop (across all domains) were larger than just the 3-arg stuff, but it seems like one part of that change may have been to increase the size of the entry in the ROB to accomodate 3-arg uops: since cmov and friends allow all the complex instruction modes, perhaps they wouldn't fit in the ROB otherwise. That could have the side effect of also allowing stuff like complex modes to avoid unlamination. Anyway, I'm about to test it.Gimble
BTW - this note: "I also found that the uop bandwidth out of the loop buffer isn't a constant 4 per cycle, if the loop isn't a multiple of 4 uops." - is perhaps the most interesting finding above, even though it isn't directly related to the above! Have you seen any confirmation elsewhere? If true, it should be a prominent item in optimization guides, reflected in IACA, etc. You seem to indicate that you think it is a limitation of the loop buffer only, but it would be very odd if the loop buffer had worse throughput than the uop buffer or legacy decode modes. Worth an investigation...Gimble
@BeeOnRope: A 7-uop loop will issue groups of 4|3|4|3|... I haven't tested larger loops (that don't fit in the loop buffer) to see if it's possible for the first instruction from the next iteration to issue in the same group as the taken-branch to it, but I assume not. I think the point is that the loop buffer really can provide a guaranteed 4u/c, while the other uop sources don't even without branches. (uop cache-line boundaries can limit the front-end, since at least SnB is only capable of reading from 1 ucache line per clock, and only reads 4 of the up-to-6 uops) SKL is different, IIRCOpt
The partial-group-at-the-end effect is a bigger factor for tiny loops, because 5 uops per 2 cycles is way worse than 97 uops per 25 cycles (assuming perfect uop-cache throughput). This may be why it's worth specifically mentioning.Opt
Yeah, it definitely adds another factor to consider in decisions like "unroll vs not unroll" (in favor of unroll) and also in instruction selection (e.g., you might prefer an 8-uop loop over a 7-uop loop because it is better in some other criteria such as instruction size, HT friendliness, cache friendliness, power use, AMD performance, whatever).Gimble
It seems like Skylake has fixed the unlamination issue. My test of your code shows the exact same results for both the 1-arg and 2-arg versions. I'll put my results as another answer, but feel free to edit it into your answer too.Gimble
@PeterCordes - your finding about loop bodies is interesting enough to warrant a separate question, which I posted here. I'm investigating the answer for Skylake, which is the only hardware I have available. Well I think I have some SB too (~2013 MacBook Air?) but it's running OSX.Gimble
BTW, I posted over on Agner's blog about this issue, linking to this question, with the idea that this would be great to cover in his manual. I think based on the investigation here the issue is pretty much fully understood - but having it in what is kind of the canonical (or at least best) source would be ideal.Gimble
@PeterCordes - I posted the results of my investigation on the multiple-of-4 issue, covering not only the the LSD but the legacy decoder and uop cache too. The summary on Skylake is that indeed the LSD has various restrictions, but it is far from as simple as "must be a multiple of 4". For example, a 7 uop list required 2 cycles, as you'd expect from the simple 4N interpretation, but a 9 uop loop required 2.3 cycles (not the 3 you'd expect if it was rounded to a 12 uops). More mysteries abound in the LSD. The DSB and legacy decode were simpler.Gimble
Working on an update to this: HSW/SKL can only keep a uop micro-fused if it has 2 operands and treats the dst register as read-modify-write. e.g. paddb xmm0, [rdi+rbx] but not vpaddb xmm0, xmm0, [rdi+rbx] or pabsb xmm0, [rdi+rdx].Opt
@Peter - Huh so that substantially rules out VEX encoded stuff, which is generally three argument? Kind of unfortunate because before knowing that VEX seemed like a clear win.Gimble
@BeeOnRope: Yeah, it's an unfortunate downside to VEX :(. I forgot to clarify that it's still only a problem for indexed addressing modes, and only for ALU+load. VEX stores stay micro-fused. You can usually just unroll to amortize the cost of incrementing more pointers and use vpaddb xmm0, xmm0, [rdi+32], which stays micro-fused even on SnB/IvB. But even with indexed addressing modes, VEX non-destructive 3-operand is still usually a win for front-end throughput because of avoiding so many MOVDQA reg,reg uops.Opt
@PeterCordes - I'm not sure if you're still working on a re-write, but I added the new info you discovered about HSW/SKL and 2-operand RMW to "the rules" because it's too important to be lost down here in a comment, I think. It really expands the cases where fusion doens't happen: not fusing for VEX-encoded memory source ops is pretty important to keep in mind.Gimble
@BeeOnRope: thanks, I'd been meaning to make a minor edit with that before I get back to the big edit. The current state of the text I'm working on includes a complete explanation of what micro-fusion is, because Agner's description is incomplete and doesn't descripe where un-lamination happens (or mention it at all). It was getting so big that it's a bit daunting to get back to. It's hard to limit it to just describing the behaviour without spending too much time on optimization advice like when you'd want to use indexed addressing modes or not.Opt
The condition about "... and treats the dst register as read-modify-write" is a bit subtle. For example popcnt and bsf can both micro-fuse with indexed addressing modes, even though the former has a write-only destination, and latter is mostly write-only. I guess the "false dependency" saves them from unlamination? On the other hand, tzcnt doesn't micro-fuse (and it has no false dependency) so that's one way bsf is better than tzcnt!Gimble
BTW, I tested most of the cases you mention on CannonLake and everything is the same as Skylake, except that now popcnt does unlaminate, which lines up with the fact that the false dep for popcnt is fixed in CNL.Gimble
@PeterCordes The raw event that you used r1c2 is documented as Counts the number of micro-ops retired, (macro-fused=1, micro-fused=2, others=1; maximum count of 8 per cycle). only for CPUs with DisplayModel_DisplayFamily: 06_1AH, 06_1EH, 06_1FH, and 06_2EH. On my WhL it is 06_8e. Can the event be reliably used on SkL/KbL/WhL uarchs? In the event table list of my architecture (also SkL and KbL) it as UOPS_RETIRED.TOTAL_CYCLES with the signature cpu/event=0xc2,umask=0x1,cmask=0x10,inv/ and there is no UOPS_RETIRED.ANY.Including
@PeterCordes I think it is inappropriate to use r1c2, r2c2 at least on WhL with DisplayFamily_DisplayModel 06_8e as it is not documented in the related event table and Intel Manual Vol.3/19 clearly states: All performance event encodings not documented in the appropriate tables for the given processor are considered reserved, and their use will result in undefined counter updates with associated overflow actions.Including
@St.Antario: IDK, the r1/2c2 data in this answer was collected on SnB (i5-2500k). I generally just use named events provided by perf, or previously ocperf.py, without getting into the details of event/umask programming. On my SKL, the event you want is uops_retired.retire_slots, like I used in the SKL section here. See also Can x86's MOV really be "free"? Why can't I reproduce this at all? where I used uops_issued_any:u and uops_executed.thread:u. (Issue and retire fused-domain counts are the same if there are no mispredicts or other rollbacks.)Opt
@PeterCordes There are no issues with uops_executed.any/uops_issued.thread counters. Both are presented in the perf list. UOPS_RETIRED.RETIRE_SLOTS was also presented and correponded to the r2c2 raw counter on SkL/KbL/CfL. It was documented as Counts the number of retirement slots used each cycle. If the retirement slots used are accounted with respect to microfusion then this is the counter I'm looking for. But the documentation is not that detailed anyway...Including
@St.Antario: Yeah, I had to test it to see that it counted 1 for an instruction that stayed micro-fused, and 2 for instructions that un-laminated or couldn't micro-fuse in the first place. But we do know that retirement happens in the fused domain so the "retire slots" name makes sense.Opt
I was wondering where unlamination really occurs and whether instructions that need to be unlaminated can bypass the IDQ i.e. it goes to the allocator whilst it is cached in the queue. In which case it would be the allocator that unlaminates. Which would increase the LSD/IDQ entries available (but not ROB). Alternatively, it appears in the IDQ and then allocator unlaminates instantly (its never not in an unlaminated state in the queue). If unlamination were done while in the IDQ (I.e. enters laminated and then unlaminates) it would mean a 1 cycle delay and wouldn't be able to bypass the IDQBechance
@LewisKelsey: I think I tried to test LSD capacity on HSW at some point but I don't remember what I found; hopefully I wrote down the results somewhere. But IIRC, there's some evidence that un-lamination can take an extra cycle in the front-end sometimes for the extra uop, like a loop that needs un-lamination can be slower than a loop which isn't fused in the first place. Perhaps on one of Andreas Abel's questions? That would indicate that un-lamination doesn't happen until after the IDQ. But it's been a while since I looked, and I forget how well we ruled out alternative explanations.Opt
Or maybe I'm just remembering those ideas that I already mentioned as TODO items in this answer but haven't actually done. I don't have a machine with an LSD I can test on, unless I somehow boot my SKL with old microcode.Opt
@PeterCordes re:"with some notes on vpgatherdd being about 1.7x more cycles than a pinsrw loop" any chance you have link with more information on this? (I.e didn't see any references in the earlier version).Gal
@Noah: Version 8 looks like the last version with those 2 paragraphs: stackoverflow.com/revisions/31027695/8 with a link to github.com/pcordes/par2-asm-experiments.git. (That's probably better done with pshufb for register LUTs, because IIRC the problem is separable into 4-bit chunks. That was one of the first asm-optimization things I played with after getting into Agner Fog's manuals; before that I knew about asm in general, and some about CPU architecture, but hadn't seriously put it to use for optimization. So I didn't realize how good pshufb was.)Opt
@PeterCordes and yet you where already providing fantastic answers like this one! But nice to see you are human too xDGal
@PeterCordes if you have the words renaming bubbles from unlamination might be a useful link in the in the further reading / performance impact.Gal
G
11

Note: Since I wrote this answer, Peter tested Haswell and Skylake as well and integrated the results into the accepted answer above (in particular, most of the improvements I attribute to Skylake below seem to have actually appeared in Haswell). You should see that answer for the rundown of behavior across CPUs and this answer (although not wrong) is mostly of historical interest.

My testing indicates that on Skylake at least1, the processor fully fuses even complex addressing modes, unlike Sandybridge.

That is, the 1-arg and 2-arg versions of the code posted above by Peter run in the same number of cycles, with the same number of uops dispatched and retired.

My results:

Performance counter stats for ./uop-test:

     23.718772      task-clock (msec)         #    0.973 CPUs utilized          
    20,642,233      cycles                    #    0.870 GHz                    
    80,111,957      instructions              #    3.88  insns per cycle        
    60,253,831      uops_executed_thread      # 2540.344 M/sec                  
    80,295,685      uops_issued_any           # 3385.322 M/sec                  
    80,176,940      uops_retired_retire_slots # 3380.316 M/sec                  

   0.024376698 seconds time elapsed

Performance counter stats for ./uop-test x:

     13.532440      task-clock (msec)         #    0.967 CPUs utilized          
    21,592,044      cycles                    #    1.596 GHz                    
    80,073,676      instructions              #    3.71  insns per cycle        
    60,144,749      uops_executed_thread      # 4444.487 M/sec                  
    80,162,360      uops_issued_any           # 5923.718 M/sec                  
    80,104,978      uops_retired_retire_slots # 5919.478 M/sec                  

   0.013997088 seconds time elapsed

Performance counter stats for ./uop-test x x:

     16.672198      task-clock (msec)         #    0.981 CPUs utilized          
    27,056,453      cycles                    #    1.623 GHz                    
    80,083,140      instructions              #    2.96  insns per cycle        
    60,164,049      uops_executed_thread      # 3608.645 M/sec                  
   100,187,390      uops_issued_any           # 6009.249 M/sec                  
   100,118,409      uops_retired_retire_slots # 6005.112 M/sec                  

   0.016997874 seconds time elapsed

I didn't find any UOPS_RETIRED_ANY instruction on Skylake, only the "retired slots" guy which is apparently fused-domain.

The final test (uop-test x x) is a variant that Peter suggestions which uses a RIP-relative cmp with immediate, which is known not to microfuse:

.loop_riprel
    cmp dword [rel mydata], 1
    cmp dword [rel mydata], 2
    dec ecx
    nop
    nop
    nop
    nop
    jg .loop_riprel

The results show that the extra 2 uops per cycle are picked up by the uops issued and retired counters (hence the test can differentiate between fusion occurring, and not).

More tests on other architectures are welcome! You can find the code (copied from Peter above) in github.


[1] ... and perhaps some other architectures in-between Skylake and Sandybridge, since Peter only tested SB and I only tested SKL.

Gimble answered 2/9, 2016 at 5:31 Comment(10)
Did you test any cases that are known not to micro-fuse in the first place? e.g. RIP-relative with immediate? (either read-modify-write ALU, mov store, or cmp/test mem, imm8). It would be very good to confirm that your perf-counter results do show the difference between micro-fusion and no micro-fusion.Opt
I'm not following this part: (either read-modify-write ALU, mov store, or cmp/test mem, imm8). Is RIP-relative + immediate enough to avoid micro-fusion or do I need to use it in one of those instructions?Gimble
You need a RIP-relative and an immediate in the same insn. There are three different cases: store-only (mov dword [rel symbol], 1234), load-only (cmp dword [rel symbol], 1), and read-modify-write (or dword [rel symbol], 1). There are also some instructions that apparently never micro-fuse, according to Agner's tables. e.g. shlx r,m,i is 2 uops in fused and unfused domains, but only 1 uop with a register src. Similarly, pblendw is like this. pinsrb/w/d/q is either 2p5 (red src) or p5+p23 (mem src).Opt
BTW, Agner already documents cases that never micro-fuse in the first place, since his testing method detected them. I think he was testing via uop cache consumption, which is why he didn't detect unlamination.Opt
OK, I tested the cmp [sym], 1 variant and indeed it shows 2 more uops issued and retired per loop (i.e., the last two counters above), and an increase in cycles. Other counters unchanged.Gimble
Updated question with results and github repo.Gimble
@PeterCordes - missed your first question. Yeah I got the names from ocperf.py. The missing events seem consistent with other lists I've seen. I didn't follow the NOP comment - the existing "retired" counters all already seem to count the NOPs? Only the "uops issued" counter seems to ignore them (indeed, since they are eliminated prior to issue). Do NOPs take an entry in the ROB?Gimble
You're right, that was nonsense. It's been a while since I looked at my test code and numbers in detail. I guess NOPs take ROB entries. You have to be able to jmp to them, so they definitely need uop-cache entries. There doesn't seem to be any need for an interrupt to be able to happen between two NOPs, but x86 has lots of corner cases. (e.g. mov ss, reg disables interrupts until after the next instruction.) Since running NOPs isn't usually a performance bottleneck, presumably Intel just let them go through the pipe instead of totally hiding them.Opt
NOPs do issue and retire, but they don't dispatch to any execution unit. That's what we actually see in the perf results. They're not really "eliminated"; only in the sense that the work of a MOV or XOR-zeroing is eliminated, not the whole uop. So I guess they enter the ROB in an already-executed state, and never enter the RS (scheduler, aka Reservation Station)Opt
Updated my answer with test results from a Haswell laptop and my SKL desktop. HSW can micro-fuse indexed addressing modes the same way SKL can. IACA is wrong.Opt
C
7

Older Intel processors without a uop cache can do the fusion, so maybe this is a drawback of the uop cache. I don't have the time to test this right now, but I will add a test for uop fusion next time I update my test scripts. Have you tried with FMA instructions? They are the only instructions that allow 3 input dependencies in an unfused uop.

Cossack answered 12/7, 2015 at 5:49 Comment(2)
I haven't. I don't have a Haswell CPU. >.< But that's an excellent point, fusion rules might be different.Opt
@PeterCordes, I orginally discovered this from a question using FMA. See the part when I discuss Stephen Canon's comment. He suggested ""using the store address as the offset for the load operands." which allows the store to use port 7. However, this does not fuse so it's no better. The only solution which allowed me to have four fused microps (6 total) was Evgeny Kluev suggestion using a static array and one register mode. I asked this question because of that question.Moonlit
C
6

I have now reviewed test results for Intel Sandy Bridge, Ivy Bridge, Haswell and Broadwell. I have not had access to test on a Skylake yet. The results are:

  • Instructions with two-register addressing and three input dependencies are fusing allright. They take only one entry in the micro-operation cache as long as they contain no more than 32 bits of data (or 2 * 16 bits).
  • It is possible to make instructions with four input dependencies, using fused multiply-and-add instructions on Haswell and Broadwell. These instructions still fuse into a single micro-op and take only one entry in the micro-op cache.
  • Instructions with more than 32 bits of data, for example 32 bits address and 8 bits immediate data can still fuse, but use two entries in the micro-operation cache (unless the 32 bits can be compressed into a 16-bit signed integer)
  • Instructions with rip-relative addressing and an immediate constant are not fusing, even if both the offset and the immediate constant are very small.
  • All the results are identical on the four machines tested.
  • The tests were performed with my own test programs using the performance monitoring counters on loops that were sufficiently small to fit into the micro-op cache.

Your results may be due to other factors. I have not tried to use the IACA.

Cossack answered 1/12, 2015 at 14:54 Comment(3)
I was using small ~8 uop loops on SnB, and looking at the perf counters for fused and unfused domain uops. Can you see anything wrong with my test code (posted in my answer)? I was using instructions like or eax, [rsi + 4 + rdi], which only has 32bits of data (the offset). Was I looking at the wrong perf counter or something? The change in observed behaviour (cycles to run the loop) matches up with fusion not happening -> loop takes more cycles per iteration because of the 4-wide pipe. And fused-domain matches unfused-domain counts.Opt
I was testing fused-domain uops against the 4-wide limit of the pipeline for issuing / retiring 4 fused-domain uops per clock. Is it possible that the uop cache can fuse better than the rest of the pipeline? My test was with tiny loops, which fit in the loop buffer, so the uop cache shouldn't have been directly involved.Opt
Intel's optimization manual confirms that micro-fusion happens in the decoders, but indexed addressing modes are "un-laminated" as they issue. Others stay fused. So micro-fusion doesn't help when the 4-wide issue/retire throughput is the bottleneck, nor does it help with fitting more insns into the ROB. See my updated answer.Opt

© 2022 - 2024 — McMap. All rights reserved.