Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?
Asked Answered
T

1

7

The front end of recent Intel CPUs contains one complex decoder and a number of simple decoders. The complex decoder can handle instructions that decode to multiple µops, whereas the simple decoders support only instructions that decode to a single (fused-domain) µop.

Can all 1-µop instructions be decoded by the simple decoders, or are there 1-µop instructions that can only be handled by the complex decoder?

Triviality answered 24/5, 2020 at 0:15 Comment(5)
I think I might have read something about an instruction that surprisingly couldn't decode in a simple decoder, but I don't think it was for SnB-family CPUs; maybe a low-power uarch. (Intel decoders hold back macro-fusable instructions until the next group in case there's a jcc, but I don't mean that). Is there any hint / evidence that simple decoders might not handle every single-uop insn that we could investigate further?Antung
"xor rax, rax; setnle al" has a throughput of 1 if it goes through the decoders; if it comes from the DSB, the throughput is, as expected, 0.5 cycles. This seems to suggest that setnle might only be able to use the complex decoder. Or is there some other bottleneck in the first case that I'm missing?Triviality
Interesting; does xor eax,eax run as expected? Does padding it with a dummy REP or DS instead of REX.W prefix still slow it down when not coming from the DSB?Antung
xor eax, eax; setnle al has the same behavior as xor rax, rax; setnle al.Triviality
Also, if I add another instruction that requires the complex decoder, such as xor rbx, rbx; setnle bl; movq2dq xmm0, mm0 the throughput becomes 2 (vs. 1 in the DSB case).Triviality
A
9

No, there are some instructions that can only decode 1/clock

This effect is Intel-only, not AMD.

Theory: the "steering" logic that sends chunks of machine code to decoders looks for patterns in the opcode byte(s) during pre-decode, and any pattern-match that might be a multi-uop instructions has to get sent to the complex decoder. To save power (and latency?) it accepts some false-positive detections of instructions as being possibly multi-uop.

The steering logic is I think smart enough to look at the addressing mode to distinguish mov dword [rdi], 1 (1 uop micro-fused) from mov dword [rip+rel32], imm32 which can't micro-fuse even in the decoders (because of RIP-relative and immediate) and thus is 2 uops. (TODO: test this, maybe with something that's a load + immediate like rorx eax, [rdi], 4, and/or with an actual multi-uop instruction mixed in.)

Every case we've seen so far has been an instruction where a very similar instruction is multi-uop, as discussed in comments. Except for prefetch and popcnt; IDK what's up with that, since popcnt is always single uop on Skylake with any operand-size, register or memory source.

Andreas Abel identified the affected instruction on Haswell (https://justpaste.it/1juoc), and Skylake (https://justpaste.it/85otd). These are the Skylake cases:

  • bswap r32 (1 uop) vs. bswap r64 (2 uops) differs only in the REX.W prefix, not in the opcode.

  • bt reg, imm or bt reg,reg is 1 uop, but 2 or 10 uops for bt with a memory destination (crazy CISC semantics with a register index into the bitstring). Same for bts/btr/btc, memory destination form is 3 or 11 uops.

  • cdq and cqo are 1 uop, but the same opcode with a 66 prefix is cwd, 2 uops on Sandybridge-family.

  • cbw / cwde / cdqe (opcode 98h) are all 1 uop on Skylake; perhaps they're getting lumped in with cwd / cdq / cqo (opcode 99h), or this is leftover steering logic from some earlier uarch. I did confirm that it's truly a decode bottleneck on Skylake by alternating with xor eax,eax to break the dependency.

  • all cmovcc and setcc: Some forms of cmovcc and setcc are 2 uops, since Broadwell changed to having SPAZO and CF as separate inputs to the instruction instead of needing FLAGS merging. Instead of special-casing seta/cmova and setbe/cmovbe as 2 uop instructions, all setcc and cmov instructions are steered to the complex decoder.

  • vpmovsx/zx with a YMM destination: vpmovzxbd ymm, xmm is 1 uop, but vpmovzxbd ymm, [rdi] can never micro-fuse so it's 2 uops in the decoders. The steering logic doesn't check for the register source version, at least in Skylake. In a SIMD loop, it will be running from the uop cache so this isn't a problem. vpmovzxbd xmm, xmm isn't affected, so the steering logic does check the vector width.

  • adc reg, 0 as 1 uop is a special case of adc reg, imm8 (2 uops) on Haswell and earlier. On Skylake the adc al, 0 special encoding is 2 uops for no reason, even though the 3-byte encoding is 1 uop, so that's a separate missed-optimization in the CPU design. IIRC, adc reg, 0 can decode on any port on Skylake, since it's a different opcode than the AL special case.

  • PREFETCHNTA / PREFETCHT0 /PREFETCHT1 / PREFETCHT2 - unexplained

  • popcnt r16/32/64, r/m - unexplained, all forms are single-uop.

Not every instruction with multi-uop forms is on the list; the steering logic apparently does more detailed checks to distinguish things like vinsertf128 and vinsertps xmm source (1 uop) from memory source (2 uops). But where there are decode slowdowns, it's explainable by the pattern-matching for that opcode or group of opcodes not doing that extra checking. Except for popcnt and prefetch; perhaps they're similar to some other opcode, or that's a missed optimization in the CPU.


Experimental testing of uop cache (fast) vs. legacy decode (slow)

This proves there's a real effect, and the bottleneck is in the legacy decoders.

Andreas's comments indicate that xor eax,eax / setnle al seems to have a decode bottleneck of 1/clock. I found the same thing with cdq: Reads EAX, writes EDX, also demonstrably runs faster from the DSB (uop cache), and doesn't involve partial-registers or anything at all weird, and doesn't need a dep-breaking instruction.

Even better, being a single-byte instruction it can defeat the DSB with only a short block of instructions. (Leading to misleading results from testing on some CPUs, e.g. in Agner Fog's tables and on https://uops.info/, e.g. SKX shown as 1c throughput.) https://www.uops.info/html-tp/SKX/CDQ-Measurements.html vs. https://www.uops.info/html-tp/CFL/CDQ-Measurements.html have inconsistent throughputs because of different testing methods: only the Coffee Lake test ever tested with a small enough unroll count (10) to not bust the DSB, finding a throughput of 0.6. (The actual throughput is 0.5 once you account for loop overhead, fully explained by back-end port pressure same as cqo. IDK why you'd find 0.6 instead of 0.55 with only one extra uop for p6 in the loop.)

(Zen can run this instructions with 0.25c throughput; no weird decode problems and handled by every integer-ALU port.)


times 10 cdq in a dec/jnz loop can run from the uop cache, and runs at 0.5c throughput on Skylake (p06), plus loop overhead which also competes for p6.

times 20 cdq is more than 3 uop cache lines for one 32-byte block of machine code, meaning the loop can only run from legacy decode (with the top of the loop aligned). On Skylake this runs at 1 cycle per cdq. Perf counters confirm MITE delivers 1 uop per cycle, rather than groups of 3 or 4 with idle cycles between.

default rel
%ifdef __YASM_VER__
    CPU Skylake AMD
%else
%use smartalign
alignmode p6, 64
%endif

global _start
_start:
    mov  ebp, 1000000000

align 64
.loop:
    ;times 10 cdq   ; 0.5c throughput
    ;times 20 cdq   ; 1c throughput, 1 MITE uop per cycle front-end

    ; times 10 cqo        ; 0.5c throughput 2-byte insn fits uop cache
    ; times 10 cdqe       ; 1c throughput data dependency
    ;times 10 cld         ; ~4c throughput, 3 uops

    dec ebp
    jnz .loop
.end:

    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)

On my Arch Linux desktop, I built this into a static executable to run under perf:

  • i7-6700k with epp=balance_performance (max "turbo" = 3.9GHz)
  • microcode revision 0xd6 (so LSD disabled, not that it matters: loops can only run from the LSD loop buffer if all their uops are in the DSB uop cache, IIRC.)
  #   in a bash shell:
t=cdq-latency; nasm -f elf64 "$t".asm && ld -o "$t" "$t.o" && objdump -drwC -Mintel "$t" && 
  taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,frontend_retired.dsb_miss,idq.dsb_uops,idq.mite_uops,idq.mite_cycles,idq_uops_not_delivered.core,idq_uops_not_delivered.cycles_fe_was_ok,idq.all_mite_cycles_4_uops ./"$t"

disassembly

0000000000401000 <_start>:
  401000:       bd 00 ca 9a 3b          mov    ebp,0x3b9aca00
  401005:       0f 1f 84 00 00 00 00 00         nop    DWORD PTR [rax+rax*1+0x0]
...
  40103d:       0f 1f 00                nop    DWORD PTR [rax]

0000000000401040 <_start.loop>:
  401040:       99                      cdq    
  401041:       99                      cdq    
  401042:       99                      cdq    
  401043:       99                      cdq    
...
  401052:       99                      cdq    
  401053:       99                      cdq             # 20 total CDQ
  401054:       ff cd                   dec    ebp
  401056:       75 e8                   jne    401040 <_start.loop>

0000000000401058 <_start.end>:
  401058:       31 ff                   xor    edi,edi
  40105a:       b8 e7 00 00 00          mov    eax,0xe7
  40105f:       0f 05                   syscall 

Perf results:

 Performance counter stats for './cdq-latency':

          5,205.44 msec task-clock                #    1.000 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 1      page-faults               #    0.000 K/sec                  
    20,124,711,776      cycles                    #    3.866 GHz                      (49.88%)
    22,015,118,295      instructions              #    1.09  insn per cycle           (59.91%)
    21,004,212,389      uops_issued.any           # 4035.049 M/sec                    (59.97%)
     1,005,872,141      frontend_retired.dsb_miss #  193.235 M/sec                    (60.03%)
                 0      idq.dsb_uops              #    0.000 K/sec                    (60.08%)
    20,997,157,414      idq.mite_uops             # 4033.694 M/sec                    (60.12%)
    19,996,447,738      idq.mite_cycles           # 3841.451 M/sec                    (40.03%)
    59,048,559,790      idq_uops_not_delivered.core # 11343.621 M/sec                   (39.97%)
       112,956,733      idq_uops_not_delivered.cycles_fe_was_ok #   21.700 M/sec                    (39.92%)
           209,490      idq.all_mite_cycles_4_uops #    0.040 M/sec                    (39.88%)

       5.206491348 seconds time elapsed

So the loop overhead (dec/jnz) happened basically for free, decoding in the same cycle as the last cdq. Counts are not exact because I used too many events in one run (with HT enabled), so perf did software multiplexing. From another run with fewer counters:

# same source, only these HW counters enabled to avoid multiplexing
          5,161.14 msec task-clock                #    1.000 CPUs utilized          

    20,107,065,550      cycles                    #    3.896 GHz                    
    20,000,134,955      idq.mite_cycles           # 3875.142 M/sec                  
    59,050,860,720      idq_uops_not_delivered.core # 11441.447 M/sec                 
        95,968,317      idq_uops_not_delivered.cycles_fe_was_ok #   18.594 M/sec                  

So we can see that MITE (legacy decode) was active basically every cycle, and that the front-end was basically never "ok". (i.e. never stalled on the back-end).


With only 10 CDQ instructions, allowing the DSB to work:

...
0000000000401040 <_start.loop>:
  401040:       99                      cdq    
  401041:       99                      cdq    
...
  401049:       99                      cdq        # 10 total CDQ insns
  40104a:       ff cd                   dec    ebp
  40104c:       75 f2                   jne    401040 <_start.loop>

 Performance counter stats for './cdq-latency' (4 runs):

          1,417.38 msec task-clock                #    1.000 CPUs utilized            ( +-  0.03% )
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 1      page-faults               #    0.001 K/sec                  
     5,511,283,047      cycles                    #    3.888 GHz                      ( +-  0.03% )  (49.83%)
    11,997,247,694      instructions              #    2.18  insn per cycle           ( +-  0.00% )  (59.99%)
    10,999,182,841      uops_issued.any           # 7760.224 M/sec                    ( +-  0.00% )  (60.17%)
           197,753      frontend_retired.dsb_miss #    0.140 M/sec                    ( +- 13.62% )  (60.21%)
    10,988,958,908      idq.dsb_uops              # 7753.010 M/sec                    ( +-  0.03% )  (60.21%)
        10,234,859      idq.mite_uops             #    7.221 M/sec                    ( +- 27.43% )  (60.21%)
         8,114,909      idq.mite_cycles           #    5.725 M/sec                    ( +- 26.11% )  (39.83%)
        40,588,332      idq_uops_not_delivered.core #   28.636 M/sec                    ( +- 21.83% )  (39.79%)
     5,502,581,002      idq_uops_not_delivered.cycles_fe_was_ok # 3882.221 M/sec                    ( +-  0.01% )  (39.79%)
            56,223      idq.all_mite_cycles_4_uops #    0.040 M/sec                    ( +-  3.32% )  (39.79%)

          1.417599 +- 0.000489 seconds time elapsed  ( +-  0.03% )

As reported by idq_uops_not_delivered.cycles_fe_was_ok, basically all the unused front-end uop slots were the fault of the back-end (port pressure on p0 / p6), not the front-end.

Antung answered 23/9, 2020 at 21:49 Comment(13)
Very interesting. I wonder if there is some pattern to these instructions, e.g. maybe they look similar (in opcode or otherwise) to instructions that do take multiple uops? Presumably the problem is a heuristic in the steering logic that steers these to the complex decoder. An alternate explanation is that they do have to go to the complex decoder since there is something more complicated about them, but that seems less likely.Freeze
@BeeOnRope: Keeping the steering logic simple (and low latency?) sounds like a good guess. That makes more sense than wanting to keep the simple decoders even simpler by not replicating the logic to decode cdq. setcc it relatively weird in terms of what it does (reading only flags, writing a register, although of course it's actually RMW a register since Intel doesn't rename low-8 regs anymore), but I would have thought that was only for the back-end; in the front-end it's a normal 2-byte opcode + modrm.Antung
@BeeOnRope: If you would like to investigate this further, here is a list of 1-uop instruction that seem to require the complex decoder for Skylake: justpaste.it/85otd and here is one for Haswell: justpaste.it/1juocTriviality
@AndreasAbel: The Haswell 1c-adc imm8 instructions are just the imm8=0 special case, right? (BDW was the first SnB-fam CPU to make adc 1 uop in the normal case). Makes some sense that only the complex decoder would have logic to decode ADC at all, since it's normally 2 uops. SKL removed it from the list. So that's probably the least surprising.Antung
@AndreasAbel: The presence of YMM-destination VPMOVZX/SX* on the list makes me think of the fact that it can't micro-fuse a memory operand at all, even if it's not an indexed addressing mode. The XMM versions can, but the YMM versions can't. But with a register source it is only 1 uop. As for bswap r32, the same opcode is 2 uops with 64-bit operand size. bt* are potentially weird with a memory destination so that makes some sense. Yeah, very interesting, there might be some plausible explanation for some groups of such instructions.Antung
@PeterCordes Yes, it's the imm8=0 special case.Triviality
For setcc and cmovcc the behavior would be explained by the fact that some variants of the instruction need two uops (ones like cmovbe which read from both SPAZO and C flag groups). The predecoder steers based only on the opcode, and then the decoder sorts out how many uops are needed? Same for VPMOVSX* because of the lack of fusion in that one case.Freeze
@BeeOnRope: Oh, I hadn't realized setbe and seta were also 2 uops on Skylake :/ They "only" have 3 inputs (old register value, and 2 flags). Maybe they're just unchanged from Sandybridge / Haswell, which had the exact same effect for those predicates, with others being 1 uop. The fact that AMD didn't turn setcc into setcc r/m32 for 64-bit mode is one of my biggest pet peeves. Would have been a tiny change. (Of course, the CPU still has to decode 32-bit machine code where it has to be r/m8 so we'd probably have ended up with it limited to the complex decoder.)Antung
@PeterCordes doesn't this also happen (goes to complex decoder) when an instruction has more than one prefix or when the instruction is longer than a certain number of bytes; it used to be 8 bytesCarlin
@LewisKelsey: Not that I'm aware of; did you test it? Inside a big %rep 2000 (NASM), I put instructions like 67 67 67 67 67 67 67 48 01 d0 (times 7 db 0x67 / add rax,rdx) with different destination registers. I got about 1.59 IPC average, matching the expected bottleneck on instruction-fetch bandwidth of 16B/clock / (10B / insn) = 1.6 insn/clock. (And perf counters confirm 0 idq.dsb_uops, all from MITE). Confirming that my i7-6700k Skylake was able to decode better than 1/clock on an insn stream that consists of 10-byte insns, each with 8 prefixes. godbolt.org/z/fof7b63K6Antung
@PeterCordes that is good to know. I'm pretty sure before the IQ was introduced, and when you had an IFETCH block in the steering buffer, only so many bytes were steered to each decoder and you needed multiple cycles to produce a multiple-prefix vector -- at least thats what the patents around the P6 / NetBurst time were showing.Carlin
@PeterCordes tested 67 67 67 67 67 67 67 67 67 67 48 01 d0 on KBL repeat 10000x, MITE cycles is 8000, drops to 5000 with 9x 67 and 4000 when 8x. 2500 with 1-2 prefixes steady. when you have 3-7 prefixes it goes up and down between 2500 to 4000. no dsb uops and no hw interruptsCarlin
A couple weeks ago I tried to port the test to check long instructions on Core 2 and PIII. I forget what I got on Core 2, but I couldn't get anything fast on PIII (there's always a penalty for more than 1 prefix), although there were some differences in slowdown depending on the prefix. Maybe alternating short/long could confirm the effect on PIII, if I can cook up a 9-byte insn that only has one prefix (not counting 0F). Hmm, mov [modrm], imm32 should do it, except that's 2 uops and stores are 1/clock throughput. :/Antung

© 2022 - 2024 — McMap. All rights reserved.