Can two fuseable pairs be decoded in the same clock cycle?

Asked 12/11, 2021 at 3:12 Answered 12/11, 2021 at 8:12

Solved assembly x86 cpu intel cpu-architecture

I'm trying to verify the conclusion that two fuseable pairs can be decoded in the same clock cycle, using my Intel i7-10700 and ubuntu 20.04.

The test code is arranged like below, and it is copied like 8000 times to avoid the influence of LSD and DSB (to use MITE mostly).

ALIGN 32
.loop_1:
    dec ecx
    jge .loop_2
.loop_2:
    dec ecx
    jge .loop_3
.loop_3:
    dec ecx
    jge .loop_4
.loop_4:
.loop_5:
    dec ecx
    jge .loop_6

The test result tells that only one pair is fused in a single cycle. ( r479 div r1002479 )

 Performance counter stats for process id '22597':

   120,459,876,711      cycles                                                      
    35,514,146,968      instructions     #    0.29  insn per cycle         
    17,792,584,278      r479             # r479: Number of uops delivered                     
                                         # to Instruction Decode Queue (IDQ) from MITE path                                  
        50,968,497      r4002479        
                                         
                                                  
    17,756,894,879      r1002479         # r1002479: Cycles MITE is delivering any Uop                                              

      26.444208448 seconds time elapsed

I don't think Agner's conclusion is wrong. Therefore, is there something wrong with my perf usage, or did I fail to find insights in the code?

Excited answered 12/11, 2021 at 3:12 Comment(1)

@AlexGuteniev Full version of my code. It's a little bit ugly and very redundant. – Excited 12/11, 2021 at 8:52

On Haswell and later, yes. On Ivy Bridge and earlier, no.

On Ice Lake and later, Agner Fog says macro-fusion is done right after decode, instead of in the decoders which required the pre-decoders to send the right chunks of x86 machine code to decoders accordingly. (And Ice Lake has slightly different restrictions: Instructions with a memory operand cannot fuse, unlike previous CPU models. Instructions with an immediate operand can fuse.) So on Ice Lake, macro-fusion doesn't let the decoders handle more than 5 instructions per clock.

Wikichip claims that only 1 macro-fusion per clock is possible on Ice Lake, but that's probably incorrect. Harold tested with my microbenchmark on Rocket Lake and found the same results as Skylake. (Rocket Lake uses a Cypress Cove core, a variant of Sunny Cove back-ported to a 14nm process, so it's likely that it's the same as Ice Lake in this respect.)

Your results indicate that uops_issued.any is about half instructions, therefore you are seeing macro-fusion of most pairs. (You could also look at the uops_retired.macro_fused perf event. BTW, modern perf has symbolic names for most uarch-specific events: use perf list to see them.)

The decoders will still produce up-to-four or even five uops per clock on Skylake-derived microarchitectures, though, even if they only make two macro-fusions. You didn't look at how many cycles MITE is active, so you can't see that execution stalls most of the time, until there's room in the ROB / RS for an issue-group of 4 uops. And that opens up space in the IDQ for a decode group from MITE.

You have three other bottlenecks in your loop:

Loop-carried dependency through dec ecx: only 1/clock because each dec has to wait for the result of the previous to be ready.
Only one taken branch can execute per cycle (on port 6), and dec/jge is taken almost every time, except for 1 in 2^32 when ECX was 0 before the dec.
The other branch execution unit on port 0 only handles predicted-not-taken branches. https://www.realworldtech.com/haswell-cpu/4/ shows the layout but doesn't mention that limitation; Agner Fog's microarch guide does.
Branch prediction: even jumping to the next instruction, which is architecturally a NOP, is not special cased by the CPU. Slow jmp-instruction (Because there's no reason for real code to do this, except for call +0 / pop which is special cased at least for the return-address predictor stack.)

This is why you're executing at significantly less than one instruction per clock, let alone one uop per clock.

Working demo of 2 fusions per clock

Surprisingly to me, MITE didn't go on to decode a separate test and jcc in the same cycle as it made two fusions. I guess the decoders are optimized for filling the uop cache. (A similar effect on Sandybridge / IvyBridge is that if the final uop of a decode-group is potentially fusable, like dec, decoders will only produce 3 uops that cycle, in anticipation of maybe fusing the dec next cycle. That's true at least on SnB/IvB where the decoders can only make 1 fusion per cycle, and will decode separate ALU + jcc uops if there is another pair in the same decode group. Here, SKL is choosing not to decode a separate test uop (and jcc and another test) after making two fusions.)

global _start
_start:
   mov ecx, 100000000
ALIGN 32
.loop:
%rep 399          ; the loop branch makes 400 total
   test ecx, ecx
   jz  .exit_loop        ; many of these will be 6-byte jcc rel32
%endrep
   dec  ecx
   jnz  .loop

.exit_loop:
   mov eax, 231
   syscall          ; exit_group(EDI)

On i7-6700k Skylake, perf counters for user-space only:

$ nasm -felf64 fusion.asm && ld fusion.o -o fusion       # static executable
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.all_mite_cycles_any_uops,idq.mite_uops -r2 ./fusion

 Performance counter stats for './fusion' (2 runs):

          5,165.34 msec task-clock                #    1.000 CPUs utilized            ( +-  0.01% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 1      page-faults               #    0.194 /sec                   
    20,130,230,894      cycles                    #    3.897 GHz                      ( +-  0.04% )
    80,000,001,586      instructions              #    3.97  insn per cycle           ( +-  0.00% )
    40,000,677,865      uops_issued.any           #    7.744 G/sec                    ( +-  0.00% )
    40,000,602,728      uops_executed.thread      #    7.744 G/sec                    ( +-  0.00% )
    20,100,486,534      idq.all_mite_cycles_any_uops #    3.891 G/sec                    ( +-  0.00% )
    40,000,261,852      idq.mite_uops             #    7.744 G/sec                    ( +-  0.00% )

          5.165605 +- 0.000716 seconds time elapsed  ( +-  0.01% )

Not-taken branches aren't a bottleneck, perhaps because my loop is big enough to defeat the DSB (uop cache), but not too big to defeat branch prediction. (Actually, the JCC erratum mitigation on Skylake will definitely defeat the DSB: if everything is a macro-fused branch, there will be one touching the end of every 32-byte region. Only if we start introducing NOPs or other instructions between branches will the uop cache be able to operate.)

We can see that everything was fused (80G instructions in 40G uops) and executing at 2 test-and-branch uops per clock (20G cycles). Also that MITE is delivering uops every cycle, 20G MITE cycles. And what it does deliver is apparently 2 uops per cycle, at least on average.

A test with alternating groups of NOPs and not-taken branches might be good to see what happens when there's room for the IDQ to accept more uops from MITE, to see if it will send non-fused test and JCC uops to the IDQ.

Further tests:

Backwards jcc rel8 for all the branches made no difference, same perf results:

%assign i 0 
%rep 399          ; the loop branch makes 400 total
   .dummy%+i:
   test ecx, ecx
   jz  .dummy %+ i
   %assign i i+1
%endrep

MITE throughput: alternating groups of NOPs and macro-fused branches

The NOPs still need to get decoded, but the back-end can blaze through them. This makes total MITE throughput the only bottleneck, instead of being limited to 2 uops / clock regardless of how many MITE could produce.

global _start
_start:
   mov ecx, 100000000
ALIGN 32
.loop:
%assign i 0 
%rep 10
 %rep 8
   .dummy%+i:
   test ecx, ecx
   jz  .dummy %+ i
   %assign i i+1
 %endrep
 times 24 nop
%endrep

   dec  ecx
   jnz  .loop

.exit_loop:
   mov eax, 231
   syscall          ; exit_group(EDI)

 Performance counter stats for './fusion':

          2,594.14 msec task-clock                #    1.000 CPUs utilized          
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 1      page-faults               #    0.385 /sec                   
    10,112,077,793      cycles                    #    3.898 GHz                    
    40,200,000,813      instructions              #    3.98  insn per cycle         
    32,100,317,400      uops_issued.any           #   12.374 G/sec                  
     8,100,250,120      uops_executed.thread      #    3.123 G/sec                  
    10,100,772,325      idq.all_mite_cycles_any_uops #    3.894 G/sec                  
    32,100,146,351      idq.mite_uops             #   12.374 G/sec                  

       2.594423202 seconds time elapsed

       2.593606000 seconds user
       0.000000000 seconds sys

So it seems MITE couldn't keep up with 4-wide issue. The blocks of 8 branches are making the decoders produce significantly less than 5 uops per clock; probably only 2 like we were seeing for longer runs of test/jcc.

24 nops can decode in

Reducing to groups of 3 test/jcc and 29 nop gets it down to 8.607 Gcycles for MITE active 8.600 cycles, with 32.100G MITE uops. (3.099 G uops_retired.macro_fused, with the .1 coming from the loop branch.) Still not saturating the front-end with 4.0 uops per clock, like I was hoping it might with a macro-fusion at the end of one decode group.
It is hitting 4.09 IPC, so at least the decoders and issue bottleneck are ahead of where they'd be with no macro-fusion.
(Best case for macro-fusion is 6.0 IPC, with 2 fusions per cycle and 2 other uops from non-fusing instructions. That's separate from unfused-domain back-end uop throughput limits via micro-fusion, see this test for ~7 uops_executed.thread per clock.)

Even %rep 2 test/JCC hurts throughput, which seems to indicate that it just stops decoding after making 2 fusions, not even decoding 2 or 3 more NOPs after that. (For some lower NOP counts, we get some uop-cache activity because the outer rep count isn't big enough to totally fill up the uop cache.)

You can test this in a shell loop like for NOPS in {0..20}; do nasm ... -DNOPS=$NOPS ... with the source using times NOPS nop.

There are some plateau/step effects in total cycles vs. number of NOPS for %rep 2, so maybe the two test/JCC uops are decoding at the end of a group, with 1, 2, or 3 NOPs before them. (But it's not super consistent, especially for lower numbers of NOPS. But NOPS=16, 17 and 18 are all right around 5.22 Gcycles, with 14 and 15 both at 4.62 Gcycles.)

There are a lot of possibly-relevant perf counters if we want to really get into what's going on, e.g. idq_uops_not_delivered.cycles_fe_was_ok (cycles where the issue stage got 4 uops, or where the back-end was stalled so it wasn't the front-end's fault.)

Anthropogenesis answered 12/11, 2021 at 8:12 Comment(26)

Does Haswell and later include IceLake (and its family)? Wikichip says only one such fusion can be performed during each cycle, in my experiments it seemed to be able to do 2 anyway – Suds 12/11, 2021 at 8:39

Great explanation!!! What a pity that I don't have enough reputation to upvote. I have learnt a lot from the answer. Thank you. – Excited 12/11, 2021 at 8:50

@harold: I don't have an Ice Lake or Tiger Lake to test on, but anyone who does can use this test code to check, if they have access to perf counters. (Fusion or not shouldn't affect overall throughput for this test, assuming the decoders are willing to decode test and JCC separately.) Agner Fog says macro-fusion works differently on ICL: "The fusion is not done by the decoders but immediately after the decode stage." I find the wikichip claim surprising; I don't think Intel would have weakened fusion too much. – Anthropogenesis 12/11, 2021 at 8:50

@moep0: you can still use the "accept" checkmark under the vote arrows, if this fully answers your question. – Anthropogenesis 12/11, 2021 at 8:51

Results were similar on Rocket Lake: MITE_CYCLES_ANY was half the number of uops, and the numbers of uops was half the number of instructions – Suds 12/11, 2021 at 9:2

@harold: And I assume it was able to execute 2 uops/clock, proving that it didn't have to slow down to 1 fusions/clock while keeping everything fused? (I figure you would have mentioned if that was the case, but since wikichip has that contradictory claim...) – Anthropogenesis 12/11, 2021 at 9:5

Clock cycles unhalted was also half the number of uops – Suds 12/11, 2021 at 9:6

For your first example fusion.asm; On Tigerlake I'm seeing it run primarily out of the DSB cache (37G dsb, 3G mite). Still 20G cycles. AFAICT this is just because the test; jz not taken is bottlenecking on p0, don't see what this says about decode. – Disciplinant 16/11, 2021 at 17:20

Wonder what this says about performance tuning for branch heavy code. I.e GLIBC's short memcpy case has a list of 4 consecutive branches. If decode is going to bottleneck after two, however, it might be better do to something more akin to linux's small memcpy case (Although so many jumps isn't great either). – Disciplinant 16/11, 2021 at 17:28

Does the rep %2; testl; jz; times 24 nops case essentially mean that after 2x fusable instructions the decode shuts down for the remainder of the cycle? – Disciplinant 16/11, 2021 at 17:29

@Noah: yes, my microbenchmark only defeats the DSB on Skylake with the JCC erratum. On other CPUs, you'd probably want to ramp up the outer repeat count. And yes, the most likely explanation for the observed performance is that decode shuts down after 2x fusable instructions, not even being willing to produce any other uops even if later instructions aren't fusable. I mostly tried nop and xor-zeroing, not for example some load or blsi or other things mixed with NOP. – Anthropogenesis 16/11, 2021 at 23:50

@Noah: We expect Tigerlake to be different because ICL apparently does macro-fusion after decode. – Anthropogenesis 16/11, 2021 at 23:51

@PeterCordes I thought TGL was essentially micro-architecturally the same as ICL except the mov elimination eratta. Bummer that it shuts down. Does this partially explain agner's finding that jmp throughput hits a slowcase with multiple jumps in the same fetch block. What if you have the extra operation in between the two macro-fusing instructions? I.e testl; jz; add; testl; jz? Will the add come through in the same fetch block? – Disciplinant 17/11, 2021 at 0:8

@PeterCordes on TGL getting some weird results. But there is at least some evidence that the middle uop goes through. For example see a difference in perf with testl; jz; shl r0, cl; testl; jz; shl r0, cl than testl; jz; testl; jz; shl r0, cl; shl r0, cl. I don't see any difference in port distribution or any other bottlenecks (only idq_uops_not_delivered_core). But if I put both shl payload uops in the loop I also get slow performance so I'm still unsure. – Disciplinant 17/11, 2021 at 0:35

@PeterCordes TGL also apparently does macro-fusion after decode as well – Disciplinant 17/11, 2021 at 1:2

@Noah: I meant TGL would be different from my SKL results, not different from ICL, sorry. I was assuming TGL and ICL were the same, and meant "since ICL" not "just ICL". Good idea to test a fusable insn like add between test/jz ops, will try that now. – Anthropogenesis 17/11, 2021 at 3:22

@PeterCordes I am wondering that why the number of nops (and perf results) can prove that 'it just stops decoding after making 2 fusions'. – Excited 17/11, 2021 at 9:6

@moep0: It doesn't fully prove it; there could be some other mechanism that explains why the front-end throughput was exactly that across a range of numbers of nops. But something is obviously reducing front-end throughput below the issue/rename bottleneck of 4 uops per clock, and that theory does explain all the data from experiments I've thought of and tried thus far. – Anthropogenesis 17/11, 2021 at 9:49

Oh, yes. It seems reasonable. Therefore, I guess test ecx, ecx will use the RAT instead of going to Reorder Buffer directly ( that's why the bottleneck is 4 uops instead of 6 uops per cycle). Besides, is there any useful blog about the RAT mechanism--I just googled some paper. – Excited 18/11, 2021 at 1:27

@moep0: uops that aren't eliminated, like a macro-fused test/jz, get added simultaneously to the RAT and the ROB, with the ROB entry marked as not yet complete. Uops that are eliminated in the front-end, like NOP, xor-zeroing, or mov reg,reg, don't get added to the RAT, but do still have to get added to the ROB (marked as already-done, ready to retire). – Anthropogenesis 18/11, 2021 at 1:41

@moep0: The issue width for adding uops to the back-end (including the ROB) is only 4-wide on Skylake, so nothing will ever get more than 4 fused-domain uops through the front-end in a single cycle. Parts of the front-end (like decode or uop-cache fetch) are wider to fill bubbles, i.e. to catch up after one decode cycle produced fewer than 4 uops, to make it more likely to be able to average closer to 4 uops. – Anthropogenesis 18/11, 2021 at 1:42

@PeterCordes Oh, I see, thank you. I do the experiment with nops from 0 to 30. There's the result. According to the theory (4 uops per cycle), it is very weird that there exist points that continuously grow like 18,19, and 20. – Excited 18/11, 2021 at 1:51

@moep0: Issue/rename is 4-wide. But this experiment is about creating a bottleneck in the decoders, which are 5-wide in Skylake. Two macro-fusions can happen at the end of a decode group, e.g. 0 to 3 NOP uops and then two macro-fused test/jcc uops, for a total of 2 .. 5 uops. en.wikichip.org/wiki/intel/microarchitectures/… has a diagram. I'm not sure all the text on wikichip is 100% accurate. e.g. it says "Only one such fusion can be performed during each cycle." in the decode queue, but we've shown it can sustain 2 fusions/clock – Anthropogenesis 18/11, 2021 at 2:30

@PeterCordes It seems that I got the two things mixed up. Understand and thanks for your patient explanation. – Excited 18/11, 2021 at 6:12

Correct. It's of course supported in GNU C++ as well, but it is a GNU extension. gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html – Anthropogenesis 18/11, 2021 at 7:57

Update: Decode on Skylake is apparently only 4 instructions wide; they only widened it in terms of being able to produce up to 5 uops, so for example a 2-1-1-1 or 4-1 pattern. Andreas Abel mentioned this in a comment but I can't find it at the moment. – Anthropogenesis 13/12, 2021 at 6:48

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

You have three other bottlenecks in your loop:

Working demo of 2 fusions per clock

Further tests:

MITE throughput: alternating groups of NOPs and macro-fused branches

Recommended topics

Hot tags