First use of AVX 256-bit vectors slows down 128-bit vector and AVX scalar ops
Asked Answered
T

1

4

Originally I was trying to reproduce the effect described in Agner Fog's microarchitecture guide section "Warm-up period for YMM and ZMM vector instructions" where it says that:

The processor turns off the upper parts of the vector execution units when it is not used, in order to save power. Instructions with 256-bit vectors have a throughput that is approximately 4.5 times slower than normal during an initial warm-up period of approximately 56,000 clock cycles or 14 μs.

I got the slowdown, although it seems like it was closer to ~2x instead of 4.5x. But what I've found is on my CPU (Intel i7-9750H Coffee Lake) the slowdown is not only affecting 256-bit operations, but also 128-bit vector ops and scalar floating point ops (and even N number of GPR-only instructions following XMM touching instruction).

Code of the benchmark program:

# Compile and run:
# clang++ ymm-throttle.S && ./a.out

.intel_syntax noprefix

.data
L_F0:
  .asciz "ref cycles = %u\n"

.p2align 5
L_C0:
  .long 1
  .long 2
  .long 3
  .long 4
  .long 1
  .long 2
  .long 3
  .long 4

.text

.set initial_scalar_warmup, 5*1000*1000
.set iteration_count, 30*1000
.set wait_count, 50*1000

.global _main
_main:
  # ---------- Initial warm-up
  # It seems that we enter _main (at least in MacOS 11.2.2) in a "ymm warmed-up" state.
  #
  # Initial warm-up loop below is long enough for the processor to switch back to
  # "ymm cold" state. It also may reduce dynamic-frequency scaling related measurements
  # deviations (hopefully CPU is in full boost by the time we finish initial warmup loop).

  vzeroupper

  push rbp
  mov ecx, initial_scalar_warmup

.p2align 4
_initial_loop:
  add eax, 1
  add edi, 1
  add edx, 1

  dec ecx
  jnz _initial_loop

  # --------- Measure XMM

  # TOUCH YMM.
  # Test to see effect of touching unrelated YMM register
  # on XMM performance.
  # If "vpxor ymm9" below is commented out, then the xmm_loop below
  # runs a lot faster (~2x faster).
  vpxor ymm9, ymm9, ymm9

  mov ecx, iteration_count
  rdtsc
  mov esi, eax

  vpxor xmm0, xmm0, xmm0
  vpxor xmm1, xmm1, xmm1
  vpxor xmm2, xmm2, xmm2
  vmovdqa xmm3, [rip + L_C0]

.p2align 5
_xmm_loop:
  # Here we only do XMM (128-bit) VEX-encoded op. But it is triggering execution throttling.
  vpaddd xmm0, xmm3, xmm3
  add edi, 1
  add eax, 1

  dec ecx
  jnz _xmm_loop

  lfence
  rdtsc
  sub eax, esi
  mov esi, eax  # ESI = ref cycles count

  # ------------- Print results

  lea rdi, [rip + L_F0]
  xor eax, eax
  call _printf

  vzeroupper
  xor eax, eax
  pop rbp
  ret​

Question: Is my benchmark correct? Is the description (below) of what's happening seem plausible?

CPU is in AVX-cold state (no 256-bit/512-bit instruction has been executed for ~675 µs) encounters a single instruction with YMM (ZMM) destination register. CPU immediately switches to some sort of "transition to AVX-warm" state. This presumably takes ~100-200 cycles mentioned in Agner's guide. And this "transition" period lasts ~56'000 cycles.

During transition period GPR code may execute normally, but any instruction that has vector destination register (including 128-bit XMM or scalar floating point instructions, even including vmovq xmm0, rax) applies throttling to entire execution pipeline. This affects GPR-only code immediately following such instruction for N-cycles (not sure how many; may be ~dozen cycles worth of instructions).

Perhaps throttling limits number of µops dispatched to execution units (regardless of what those µops are; as long as there is at least one µop with a vector destination register)?


What's new here for me is that I thought that during transition period throttling would be applied only for 256-bit (and 512-bit) instructions, but it seems like any instruction that has vector register destination is affected (as well as ~20-60 of GPR-only immediately following instructions; can't measure more precisely on my system).


Related: "Voltage Only Transitions" section of an article at Travis Downs blog may be describing the same effect. Although the author measured performance of YMM vectors during transition period, the conclusion was that it is not the upper part of the vector that's being split, rather throttling applied to entire pipeline when vector register touching instruction is encountered during transition period. (edit: the blog post did not measure XMM registers during transition period, which is what this post is measuring).

Triviality answered 30/3, 2021 at 15:43 Comment(12)
Could this be Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says? Your title mentions "SSE" scalar ops, but I still only see AVX encodings like vaddss, not addss. If you did have any legacy SSE, then see Why is this SSE code 6 times slower without VZEROUPPER on Skylake? might.Mastoiditis
Also, it's hard to tell exactly what code should be uncommented. I'd suggest uncommenting all the instructions for a minimal reproducible example that demonstrates a big surprising slowdown.Mastoiditis
@PeterCordes by "SSE ops" I meant scalar floating point ops, VEX-encoded. I assume no SSE/AVX transitions would take place here (is that assumption correct?). Sorry about messy example, I updated the gist so that it's runnable and has minimum amount of code. It should have the slowdown/throttling without needing to uncomment lines now.Triviality
Yes, no transition penalties if you actually use scalar AVX, not scalar SSE. You should fix your title, that's important.Mastoiditis
And no, off-site links aren't sufficient for Stack Overflow questions. Include a minimal reproducible example in the question itself. (It's fine to have an off-site link for more detail and more variations, but at least the essential part of the code, maybe minus some boilerplate, should be in the question. Like you have here but with the important bits uncommented for one variation.)Mastoiditis
@PeterCordes done, I just included the body of the gist instead of the pseudo-code now. Let me know if it's clear enough.Triviality
The voltage only transition where IPC is 1/4 normal that Travis Downs measured was for YMM, not ZMM. That seems to be consistent with your results.Francisco
@Francisco you're right, I fixed the post to mention YMM. I already knew YMM would execute slower based on Agner's guide. What surprised me and what is measured in this post is that XMM is causing throttling too (during transition period). In Travis's post he mentions a "wide" instruction as a cause for throttling, I assumed YMM or ZMM is implied.Triviality
@Francisco in the "Summary" it says "After a period of about 680 μs not using the AVX upper bits (255:128) or AVX-512 upper bits (511:256) the processor enters a mode where using those bits again requires at least a voltage transition, and sometimes a frequency transition" so it mentions that touching upper bits is causing the throttling, but seems like perhaps it's not the upper bits, but simply having a vector register as operand plus global dirty upper state CPU flag, so it affects AVX 128-bit and scalar as well.Triviality
All instructions are affected by the reduced IPC during the voltage transition. Travis even showed the add on GP registers was affected. Its just that instructions on xmm registers will not cause the voltage transition at all.Francisco
@Francisco "All instructions are affected by the reduced IPC during the voltage transition" I think you might be missing an important detail here. Not all instructions are affected by voltage transition itself. GPR code will run normally during transition. But if vector touching instruction is seen during transition and CPU is in Dirty Upper State, then it starts throttling for a period of N cycles. And throttling affects all instructions yes. I am just surprised that even 128-bit AVX instructions were causing the throttling too, does not seem like it's mentioned.Triviality
Right: throttling occurs if any "wide" instruction is in the instruction window and while throttling is occurring it affects all instructions (SIMD or otherwise). The confusing part is that "wide" instructions includes narrow SIMD in most cases (see answer).Armitage
A
6

The fact that you see throttling even for narrow SIMD instructions is a side-effect of a behavior I call implicit widening.

Basically, on modern Intel, if the upper 128-255 bits are dirty on any register in the range ymm0 to ymm15, any SIMD instruction is internally widened to 256 bits, since the upper bits need to be zeroed and this requires the full 256-bit registers in the register file to be powered and probably the 256-bit ALU path as well. So the instruction acts for the purposes of AVX frequencies as if it was 256-bit wide.

Similarly, if bits 256 to 511 are dirty on any zmm register in the range zmm0 to zmm15, operations are implicitly widened to 512 bits.

For the purposes of light vs heavy instructions, the widened instructions have the same type as they would if they were full width. That is, a 128-bit FMA which gets widened to 512 bits acts as "heavy AVX-512" even though only 128 bits of FMA is occurring.

This applies to all instructions which use the xmm/ymm registers, even scalar FP operations.

Note that this doesn't just apply to this throttling period: it means that if you have dirty uppers, a narrow SIMD instruction (or scalar FP) will cause a transition to the more conservative DVFS states just as a full-width instruction would do.

Armitage answered 31/3, 2021 at 21:58 Comment(1)
This is a nice explanation, thank you! The last note also very enlightening, I guess a good reason to use zeroupper even with AVX-only code.Triviality

© 2022 - 2024 — McMap. All rights reserved.