No, there are some instructions that can only decode 1/clock
This effect is Intel-only, not AMD.
Theory: the "steering" logic that sends chunks of machine code to decoders looks for patterns in the opcode byte(s) during pre-decode, and any pattern-match that might be a multi-uop instructions has to get sent to the complex decoder. To save power (and latency?) it accepts some false-positive detections of instructions as being possibly multi-uop.
The steering logic is I think smart enough to look at the addressing mode to distinguish mov dword [rdi], 1
(1 uop micro-fused) from mov dword [rip+rel32], imm32
which can't micro-fuse even in the decoders (because of RIP-relative and immediate) and thus is 2 uops. (TODO: test this, maybe with something that's a load + immediate like rorx eax, [rdi], 4
, and/or with an actual multi-uop instruction mixed in.)
Every case we've seen so far has been an instruction where a very similar instruction is multi-uop, as discussed in comments. Except for prefetch
and popcnt
; IDK what's up with that, since popcnt
is always single uop on Skylake with any operand-size, register or memory source.
Andreas Abel identified the affected instruction on Haswell (https://justpaste.it/1juoc), and Skylake (https://justpaste.it/85otd). These are the Skylake cases:
bswap r32
(1 uop) vs. bswap r64
(2 uops) differs only in the REX.W prefix, not in the opcode.
bt reg, imm
or bt reg,reg
is 1 uop, but 2 or 10 uops for bt
with a memory destination (crazy CISC semantics with a register index into the bitstring). Same for bts
/btr
/btc
, memory destination form is 3 or 11 uops.
cdq
and cqo
are 1 uop, but the same opcode with a 66
prefix is cwd
, 2 uops on Sandybridge-family.
cbw
/ cwde
/ cdqe
(opcode 98h
) are all 1 uop on Skylake; perhaps they're getting lumped in with cwd
/ cdq
/ cqo
(opcode 99h
), or this is leftover steering logic from some earlier uarch. I did confirm that it's truly a decode bottleneck on Skylake by alternating with xor eax,eax
to break the dependency.
all cmovcc
and setcc
: Some forms of cmovcc
and setcc
are 2 uops, since Broadwell changed to having SPAZO and CF as separate inputs to the instruction instead of needing FLAGS merging. Instead of special-casing seta
/cmova
and setbe
/cmovbe
as 2 uop instructions, all setcc
and cmov
instructions are steered to the complex decoder.
vpmovsx/zx
with a YMM destination: vpmovzxbd ymm, xmm
is 1 uop, but vpmovzxbd ymm, [rdi]
can never micro-fuse so it's 2 uops in the decoders. The steering logic doesn't check for the register source version, at least in Skylake. In a SIMD loop, it will be running from the uop cache so this isn't a problem. vpmovzxbd xmm, xmm
isn't affected, so the steering logic does check the vector width.
adc reg, 0
as 1 uop is a special case of adc reg, imm8
(2 uops) on Haswell and earlier. On Skylake the adc al, 0
special encoding is 2 uops for no reason, even though the 3-byte encoding is 1 uop, so that's a separate missed-optimization in the CPU design. IIRC, adc reg, 0
can decode on any port on Skylake, since it's a different opcode than the AL special case.
PREFETCHNTA / PREFETCHT0 /PREFETCHT1 / PREFETCHT2 - unexplained
popcnt r16/32/64, r/m
- unexplained, all forms are single-uop.
Not every instruction with multi-uop forms is on the list; the steering logic apparently does more detailed checks to distinguish things like vinsertf128
and vinsertps
xmm source (1 uop) from memory source (2 uops). But where there are decode slowdowns, it's explainable by the pattern-matching for that opcode or group of opcodes not doing that extra checking. Except for popcnt and prefetch; perhaps they're similar to some other opcode, or that's a missed optimization in the CPU.
Experimental testing of uop cache (fast) vs. legacy decode (slow)
This proves there's a real effect, and the bottleneck is in the legacy decoders.
Andreas's comments indicate that xor eax,eax
/ setnle al
seems to have a decode bottleneck of 1/clock. I found the same thing with cdq
: Reads EAX, writes EDX, also demonstrably runs faster from the DSB (uop cache), and doesn't involve partial-registers or anything at all weird, and doesn't need a dep-breaking instruction.
Even better, being a single-byte instruction it can defeat the DSB with only a short block of instructions. (Leading to misleading results from testing on some CPUs, e.g. in Agner Fog's tables and on https://uops.info/, e.g. SKX shown as 1c throughput.) https://www.uops.info/html-tp/SKX/CDQ-Measurements.html vs. https://www.uops.info/html-tp/CFL/CDQ-Measurements.html have inconsistent throughputs because of different testing methods: only the Coffee Lake test ever tested with a small enough unroll count (10) to not bust the DSB, finding a throughput of 0.6. (The actual throughput is 0.5 once you account for loop overhead, fully explained by back-end port pressure same as cqo
. IDK why you'd find 0.6 instead of 0.55 with only one extra uop for p6 in the loop.)
(Zen can run this instructions with 0.25c throughput; no weird decode problems and handled by every integer-ALU port.)
times 10 cdq
in a dec/jnz loop can run from the uop cache, and runs at 0.5c throughput on Skylake (p06), plus loop overhead which also competes for p6.
times 20 cdq
is more than 3 uop cache lines for one 32-byte block of machine code, meaning the loop can only run from legacy decode (with the top of the loop aligned). On Skylake this runs at 1 cycle per cdq
. Perf counters confirm MITE delivers 1 uop per cycle, rather than groups of 3 or 4 with idle cycles between.
default rel
%ifdef __YASM_VER__
CPU Skylake AMD
%else
%use smartalign
alignmode p6, 64
%endif
global _start
_start:
mov ebp, 1000000000
align 64
.loop:
;times 10 cdq ; 0.5c throughput
;times 20 cdq ; 1c throughput, 1 MITE uop per cycle front-end
; times 10 cqo ; 0.5c throughput 2-byte insn fits uop cache
; times 10 cdqe ; 1c throughput data dependency
;times 10 cld ; ~4c throughput, 3 uops
dec ebp
jnz .loop
.end:
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
On my Arch Linux desktop, I built this into a static executable to run under perf:
- i7-6700k with epp=balance_performance (max "turbo" = 3.9GHz)
- microcode revision 0xd6 (so LSD disabled, not that it matters: loops can only run from the LSD loop buffer if all their uops are in the DSB uop cache, IIRC.)
# in a bash shell:
t=cdq-latency; nasm -f elf64 "$t".asm && ld -o "$t" "$t.o" && objdump -drwC -Mintel "$t" &&
taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,frontend_retired.dsb_miss,idq.dsb_uops,idq.mite_uops,idq.mite_cycles,idq_uops_not_delivered.core,idq_uops_not_delivered.cycles_fe_was_ok,idq.all_mite_cycles_4_uops ./"$t"
disassembly
0000000000401000 <_start>:
401000: bd 00 ca 9a 3b mov ebp,0x3b9aca00
401005: 0f 1f 84 00 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
...
40103d: 0f 1f 00 nop DWORD PTR [rax]
0000000000401040 <_start.loop>:
401040: 99 cdq
401041: 99 cdq
401042: 99 cdq
401043: 99 cdq
...
401052: 99 cdq
401053: 99 cdq # 20 total CDQ
401054: ff cd dec ebp
401056: 75 e8 jne 401040 <_start.loop>
0000000000401058 <_start.end>:
401058: 31 ff xor edi,edi
40105a: b8 e7 00 00 00 mov eax,0xe7
40105f: 0f 05 syscall
Perf results:
Performance counter stats for './cdq-latency':
5,205.44 msec task-clock # 1.000 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
1 page-faults # 0.000 K/sec
20,124,711,776 cycles # 3.866 GHz (49.88%)
22,015,118,295 instructions # 1.09 insn per cycle (59.91%)
21,004,212,389 uops_issued.any # 4035.049 M/sec (59.97%)
1,005,872,141 frontend_retired.dsb_miss # 193.235 M/sec (60.03%)
0 idq.dsb_uops # 0.000 K/sec (60.08%)
20,997,157,414 idq.mite_uops # 4033.694 M/sec (60.12%)
19,996,447,738 idq.mite_cycles # 3841.451 M/sec (40.03%)
59,048,559,790 idq_uops_not_delivered.core # 11343.621 M/sec (39.97%)
112,956,733 idq_uops_not_delivered.cycles_fe_was_ok # 21.700 M/sec (39.92%)
209,490 idq.all_mite_cycles_4_uops # 0.040 M/sec (39.88%)
5.206491348 seconds time elapsed
So the loop overhead (dec/jnz) happened basically for free, decoding in the same cycle as the last cdq
. Counts are not exact because I used too many events in one run (with HT enabled), so perf did software multiplexing. From another run with fewer counters:
# same source, only these HW counters enabled to avoid multiplexing
5,161.14 msec task-clock # 1.000 CPUs utilized
20,107,065,550 cycles # 3.896 GHz
20,000,134,955 idq.mite_cycles # 3875.142 M/sec
59,050,860,720 idq_uops_not_delivered.core # 11441.447 M/sec
95,968,317 idq_uops_not_delivered.cycles_fe_was_ok # 18.594 M/sec
So we can see that MITE (legacy decode) was active basically every cycle, and that the front-end was basically never "ok". (i.e. never stalled on the back-end).
With only 10 CDQ instructions, allowing the DSB to work:
...
0000000000401040 <_start.loop>:
401040: 99 cdq
401041: 99 cdq
...
401049: 99 cdq # 10 total CDQ insns
40104a: ff cd dec ebp
40104c: 75 f2 jne 401040 <_start.loop>
Performance counter stats for './cdq-latency' (4 runs):
1,417.38 msec task-clock # 1.000 CPUs utilized ( +- 0.03% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
1 page-faults # 0.001 K/sec
5,511,283,047 cycles # 3.888 GHz ( +- 0.03% ) (49.83%)
11,997,247,694 instructions # 2.18 insn per cycle ( +- 0.00% ) (59.99%)
10,999,182,841 uops_issued.any # 7760.224 M/sec ( +- 0.00% ) (60.17%)
197,753 frontend_retired.dsb_miss # 0.140 M/sec ( +- 13.62% ) (60.21%)
10,988,958,908 idq.dsb_uops # 7753.010 M/sec ( +- 0.03% ) (60.21%)
10,234,859 idq.mite_uops # 7.221 M/sec ( +- 27.43% ) (60.21%)
8,114,909 idq.mite_cycles # 5.725 M/sec ( +- 26.11% ) (39.83%)
40,588,332 idq_uops_not_delivered.core # 28.636 M/sec ( +- 21.83% ) (39.79%)
5,502,581,002 idq_uops_not_delivered.cycles_fe_was_ok # 3882.221 M/sec ( +- 0.01% ) (39.79%)
56,223 idq.all_mite_cycles_4_uops # 0.040 M/sec ( +- 3.32% ) (39.79%)
1.417599 +- 0.000489 seconds time elapsed ( +- 0.03% )
As reported by idq_uops_not_delivered.cycles_fe_was_ok
, basically all the unused front-end uop slots were the fault of the back-end (port pressure on p0 / p6), not the front-end.
xor eax,eax
run as expected? Does padding it with a dummy REP or DS instead of REX.W prefix still slow it down when not coming from the DSB? – Antungxor eax, eax; setnle al
has the same behavior asxor rax, rax; setnle al
. – Trivialityxor rbx, rbx; setnle bl; movq2dq xmm0, mm0
the throughput becomes 2 (vs. 1 in the DSB case). – Triviality