Originally I was trying to reproduce the effect described in Agner Fog's microarchitecture guide section "Warm-up period for YMM and ZMM vector instructions" where it says that:
The processor turns off the upper parts of the vector execution units when it is not used, in order to save power. Instructions with 256-bit vectors have a throughput that is approximately 4.5 times slower than normal during an initial warm-up period of approximately 56,000 clock cycles or 14 μs.
I got the slowdown, although it seems like it was closer to ~2x instead of 4.5x. But what I've found is on my CPU (Intel i7-9750H Coffee Lake) the slowdown is not only affecting 256-bit operations, but also 128-bit vector ops and scalar floating point ops (and even N number of GPR-only instructions following XMM touching instruction).
Code of the benchmark program:
# Compile and run:
# clang++ ymm-throttle.S && ./a.out
.intel_syntax noprefix
.data
L_F0:
.asciz "ref cycles = %u\n"
.p2align 5
L_C0:
.long 1
.long 2
.long 3
.long 4
.long 1
.long 2
.long 3
.long 4
.text
.set initial_scalar_warmup, 5*1000*1000
.set iteration_count, 30*1000
.set wait_count, 50*1000
.global _main
_main:
# ---------- Initial warm-up
# It seems that we enter _main (at least in MacOS 11.2.2) in a "ymm warmed-up" state.
#
# Initial warm-up loop below is long enough for the processor to switch back to
# "ymm cold" state. It also may reduce dynamic-frequency scaling related measurements
# deviations (hopefully CPU is in full boost by the time we finish initial warmup loop).
vzeroupper
push rbp
mov ecx, initial_scalar_warmup
.p2align 4
_initial_loop:
add eax, 1
add edi, 1
add edx, 1
dec ecx
jnz _initial_loop
# --------- Measure XMM
# TOUCH YMM.
# Test to see effect of touching unrelated YMM register
# on XMM performance.
# If "vpxor ymm9" below is commented out, then the xmm_loop below
# runs a lot faster (~2x faster).
vpxor ymm9, ymm9, ymm9
mov ecx, iteration_count
rdtsc
mov esi, eax
vpxor xmm0, xmm0, xmm0
vpxor xmm1, xmm1, xmm1
vpxor xmm2, xmm2, xmm2
vmovdqa xmm3, [rip + L_C0]
.p2align 5
_xmm_loop:
# Here we only do XMM (128-bit) VEX-encoded op. But it is triggering execution throttling.
vpaddd xmm0, xmm3, xmm3
add edi, 1
add eax, 1
dec ecx
jnz _xmm_loop
lfence
rdtsc
sub eax, esi
mov esi, eax # ESI = ref cycles count
# ------------- Print results
lea rdi, [rip + L_F0]
xor eax, eax
call _printf
vzeroupper
xor eax, eax
pop rbp
ret
Question: Is my benchmark correct? Is the description (below) of what's happening seem plausible?
CPU is in AVX-cold state (no 256-bit/512-bit instruction has been executed for ~675 µs) encounters a single instruction with YMM (ZMM) destination register. CPU immediately switches to some sort of "transition to AVX-warm" state. This presumably takes ~100-200 cycles mentioned in Agner's guide. And this "transition" period lasts ~56'000 cycles.
During transition period GPR code may execute normally, but any instruction that has vector destination register (including 128-bit XMM or scalar floating point instructions, even including vmovq xmm0, rax
) applies throttling to entire execution pipeline. This affects GPR-only code immediately following such instruction for N-cycles (not sure how many; may be ~dozen cycles worth of instructions).
Perhaps throttling limits number of µops dispatched to execution units (regardless of what those µops are; as long as there is at least one µop with a vector destination register)?
What's new here for me is that I thought that during transition period throttling would be applied only for 256-bit (and 512-bit) instructions, but it seems like any instruction that has vector register destination is affected (as well as ~20-60 of GPR-only immediately following instructions; can't measure more precisely on my system).
Related: "Voltage Only Transitions" section of an article at Travis Downs blog may be describing the same effect. Although the author measured performance of YMM vectors during transition period, the conclusion was that it is not the upper part of the vector that's being split, rather throttling applied to entire pipeline when vector register touching instruction is encountered during transition period. (edit: the blog post did not measure XMM registers during transition period, which is what this post is measuring).
vaddss
, notaddss
. If you did have any legacy SSE, then see Why is this SSE code 6 times slower without VZEROUPPER on Skylake? might. – Mastoiditisadd
on GP registers was affected. Its just that instructions onxmm
registers will not cause the voltage transition at all. – Francisco