See https://github.com/dotnet/runtime/issues/92189 - For the same hardware reason that C compilers default to -mprefer-vector-width=256
when auto-vectorizing large loops, C# doesn't automatically make all vectorized code use 512-bit even if it's available.
Also, for small problems, e.g. 9 floats, it could mean no vectorized iterations happen, just scalar fallback code.
Also, apparently some code-bases (hopefully accidentally) depend on Vector not being wider than 32-byte, so it would be a breaking change for those.
@stephentoub wrote: In .NET 8, the variable-width Vector<T>
will not automatically support widths greater than 256 bits. It's likely in .NET 9 you'll be able to opt-in to that, but at present it's not clear whether it'll be enabled by default,
I commented on the dotnet github issue with some details about the CPU-hardware reasons; I'll reproduce some of that here:
- See SIMD instructions lowering CPU frequency
- Also https://reviews.llvm.org/D111029 including some Intel testing results that found clang auto-vectorization of SPEC2017 actually got a 1% slowdown with
-mprefer-vector-width=512
vs. 256
on Ice Lake Xeon. But again, that's LLVM auto-vectorization of scalar code, not like C# where this would only affect manually-vectorized loops, so the tuning considerations are somewhat different from -mprefer-vector-width=256
.
In a program that frequently wakes up for short bursts of computation, its AVX-512 usage will still lower turbo frequency for the core, affecting other programs.
Things are different on Zen 4; they handle 512-bit vectors by taking extra cycles in the execution units, so as long as 512-bit vectors don't require more shuffling work or some other effect that would add overhead, 512-bit vectors are a good win for front-end throughput and how far ahead out-of-order exec can see in terms of elements or scalar iterations. (Since a 512-bit uop is still only 1 uop for the front-end.) GCC and Clang default to -mprefer-vector-width=512
for -march=znver4
.
There's no turbo penalty or other inherent downsides to 512-bit vectors on Zen 4 (AFAIK; I don't know how misaligned loads perform). It's just a matter of whether software can use them efficiently (without needing more bloated code for loop prologues / epilogues, e.g. scalar cleanup if a masked final iteration doesn't Just Work.) AVX-512 masked stores are efficient on Zen 4, despite the fact that AVX1/2 vmaskmovps
/ vpmaskmovd
aren't. (https://uops.info/)
For code where you have exactly 32 bytes of something, if the 32-byte vectors are no longer an option then that's a loss. C#'s scalable vector-length model isn't ideal for those cases. ARM SVE or RISC-V Vector extensions where the hardware ISA are designed around a variable vector-length with masking to handle vectors shorter than the HW's native length, but doing the same thing for C# Vector<>
probably wouldn't work well because lots of hardware (x86 with AVX2, or AArch64 without SVE) can't efficiently support masking for arbitrary-length stuff.
I wrote more about Intel on the github issue, which I'm not going to copy/paste all of here.
There can be significant overall throughput gains from 512-bit vectors for some workloads on Intel CPUs. But it comes with downsides, like more expensive misaligned memory access.