.NET8 supports Vector512, but why doesn't Vector reach 512 bits?

My CPU is AMD Ryzen 7 7840H which supports AVX-512 instruction set. When I run the .NET8 program, the value of Vector512.IsHardwareAccelerated is true. But System.Numerics.Vector<T> is still 256-bit, and does not reach 512-bit. Why doesn't the Vector<T> type reach 512 bits in length? Is it currently unsupported, or do I need to tweak the configuration?

Example code:

TextWriter writer = Console.Out;
writer.WriteLine(string.Format("Vector512.IsHardwareAccelerated:\t{0}", Vector512.IsHardwareAccelerated));
writer.WriteLine(string.Format("Vector.IsHardwareAccelerated:\t{0}", Vector.IsHardwareAccelerated));
writer.WriteLine(string.Format("Vector<byte>.Count:\t{0}\t# {1}bit", Vector<byte>.Count, Vector<byte>.Count * 8));

Test results:

Vector512.IsHardwareAccelerated:        True
Vector.IsHardwareAccelerated:   True
Vector<byte>.Count:     32      # 256bit

See https://github.com/dotnet/runtime/issues/92189 - For the same hardware reason that C compilers default to -mprefer-vector-width=256 when auto-vectorizing large loops, C# doesn't automatically make all vectorized code use 512-bit even if it's available.

Also, for small problems, e.g. 9 floats, it could mean no vectorized iterations happen, just scalar fallback code.

Also, apparently some code-bases (hopefully accidentally) depend on Vector not being wider than 32-byte, so it would be a breaking change for those.

@stephentoub wrote: In .NET 8, the variable-width Vector<T> will not automatically support widths greater than 256 bits. It's likely in .NET 9 you'll be able to opt-in to that, but at present it's not clear whether it'll be enabled by default,

I commented on the dotnet github issue with some details about the CPU-hardware reasons; I'll reproduce some of that here:

See SIMD instructions lowering CPU frequency
Also https://reviews.llvm.org/D111029 including some Intel testing results that found clang auto-vectorization of SPEC2017 actually got a 1% slowdown with -mprefer-vector-width=512 vs. 256 on Ice Lake Xeon. But again, that's LLVM auto-vectorization of scalar code, not like C# where this would only affect manually-vectorized loops, so the tuning considerations are somewhat different from -mprefer-vector-width=256.

In a program that frequently wakes up for short bursts of computation, its AVX-512 usage will still lower turbo frequency for the core, affecting other programs.

Things are different on Zen 4; they handle 512-bit vectors by taking extra cycles in the execution units, so as long as 512-bit vectors don't require more shuffling work or some other effect that would add overhead, 512-bit vectors are a good win for front-end throughput and how far ahead out-of-order exec can see in terms of elements or scalar iterations. (Since a 512-bit uop is still only 1 uop for the front-end.) GCC and Clang default to -mprefer-vector-width=512 for -march=znver4.

There's no turbo penalty or other inherent downsides to 512-bit vectors on Zen 4 (AFAIK; I don't know how misaligned loads perform). It's just a matter of whether software can use them efficiently (without needing more bloated code for loop prologues / epilogues, e.g. scalar cleanup if a masked final iteration doesn't Just Work.) AVX-512 masked stores are efficient on Zen 4, despite the fact that AVX1/2 vmaskmovps / vpmaskmovd aren't. (https://uops.info/)

For code where you have exactly 32 bytes of something, if the 32-byte vectors are no longer an option then that's a loss. C#'s scalable vector-length model isn't ideal for those cases. ARM SVE or RISC-V Vector extensions where the hardware ISA are designed around a variable vector-length with masking to handle vectors shorter than the HW's native length, but doing the same thing for C# Vector<> probably wouldn't work well because lots of hardware (x86 with AVX2, or AArch64 without SVE) can't efficiently support masking for arbitrary-length stuff.

I wrote more about Intel on the github issue, which I'm not going to copy/paste all of here.

There can be significant overall throughput gains from 512-bit vectors for some workloads on Intel CPUs. But it comes with downsides, like more expensive misaligned memory access.

Recommended topics

Hot tags