How do I enable SSE4.1 and SSE3 (but NOT AVX) in MSVC

L

2

5

I am trying to enable different simd support using MSVC.

There is a page talking about enabling some simd, such as SSE2, AVX, AVX2 https://learn.microsoft.com/en-us/cpp/build/reference/arch-x86?redirectedfrom=MSDN&view=vs-2019

However, it does not mention how to enable other simd optimizations, e.g., SSE4.1, SSE4.2, SSE3 Is it possible to enable these without enabling AVX?

Also, looks like in MSVC2017 /arch:SSE2 is no longer supported/needed, can I assume that SSE3/SSE4.1/SSE4.2 are enabled by default as well?

Landslide answered 24/9, 2020 at 19:59 Comment(2)

can I assume that SSE3/SSE4.1/SSE4.2 are enabled by default as well? - No, SSE2 is baseline for x86-64. Every x86-64 CPU is guaranteed to have SSE2. I assume that's why you don't need an option for it. But there are some AMD x86-64 CPUs without SSE3, and some Intel x86-64 CPUs without SSE4.1 (e.g. first-gen Core 2). – Conscientious 24/9, 2020 at 20:11

I don't know the answer to your question, though. You might only get SSE4 without AVX via intrinsics because MSVC is bad at this (or designed around a runtime-dispatch model, not compile-time), but maybe there's an MSVC option. You could use a compiler like clang where you can use -O3 -msse4.1 or -O3 -march=penryn – Conscientious 24/9, 2020 at 20:12

H

2

Apparently you can pass /arch: options in undocumented way as /d2... options. Like /d2archAVX.

/d2archSSE42 is accepted this way (but not SSE41 or SSE3).

@Peter pointed out in a comment a case where /d2archSSE42 makes difference: https://godbolt.org/z/EsjW4vTne

Halfprice answered 25/9, 2021 at 17:18 Comment(2)

A good candidate for auto-vectorization with SSE4.1 would be something that can use pminsd, packed integer min. Enabling /d2archSSE42 gets MSVC to omit checking cmp __isa_available, 2 and a fallback scalar loop (MSVC is dumb and doesn't know how to use SSE2 pcmpgtd for the fallback). godbolt.org/z/EsjW4vTne. I had to write out a conditional expression, though; MSVC failed to vectorize with std::min. GCC was fine of course. Also, pmulld is an even simpler thing to vectorize with, arr[i] *= 1234;, and yes MSVC does that, too. – Conscientious 25/9, 2021 at 18:41

@PeterCordes, the current MSVC can use popcount, but it is used unconditionally only starting with /arch:AVX, otherwise there's a branch with detection. That's because the use of popcount is coded in the library implementation, and there's no way for a library to query for SSE4.2 or BMI in MSVC -- see godbolt.org/z/j6cd58931 – Halfprice 15/5, 2023 at 9:30

M

5

VC++ compiler is less smart than you think it is. Here’s how these settings work.

When you’re building 32-bit code and enable SSE1 or SSE2, it enables automatic vectorization into respective instruction sets.

When you’re building 64-bit code, both SSE1 and SSE2 are part of the instruction set, all AMD64 processors in the world are required to support both of these. That’s why you’re getting the warning with /arch:SSE2.

When you set up AVX the compiler does 2 things, enables automatic vectorization into AVX1, also switches instruction encoding (for all of them, both SSE, AVX, manually vectorized and auto-vectorized) from legacy to VEX. VEX is good stuff, enables to fuse unaligned RAM reads into other instructions. It also solves dependency issues which may affect performance, VEX encoded vaddps xmm0, xmm0, xmm1 zeroes out higher 16 bytes of ymm0, while legacy encoded addps xmm0, xmm0, xmm1 keeps the data there.

When you set up AVX2 it does a few minor optimizations, most notably stuff like _mm_set1_epi32 may compile into vpbroadcastd. Also switches encoding to VEX like for AVX1.

Note I marked automatic in bold. Microsoft compiler doesn’t do runtime dispatch or cpuid checks, and the automatic vectorizer doesn’t use SSE3 or 4.1. If you’re writing manually vectorized code the compiler won’t do fallbacks, will emit whatever instructions you asked for. When present, AVX/AVX2 setting only affects their encoding.

If you want to write manually vectorized code that uses SSE3, SSSE3, SSE 4.1, FMA3, AES, SHA, etc., you don’t need to enable anything. You just need to include relevant headers, and ideally ensure in runtime the CPU has them. For the last part, I usually calling __cpuid early on startup and checking these bits, this is to show a comprehensible error message about unsupported CPU, instead of a hard crush later.

Maracaibo answered 25/9, 2020 at 4:40 Comment(0)

H

2

Apparently you can pass /arch: options in undocumented way as /d2... options. Like /d2archAVX.

/d2archSSE42 is accepted this way (but not SSE41 or SSE3).

@Peter pointed out in a comment a case where /d2archSSE42 makes difference: https://godbolt.org/z/EsjW4vTne

Halfprice answered 25/9, 2021 at 17:18 Comment(2)

A good candidate for auto-vectorization with SSE4.1 would be something that can use pminsd, packed integer min. Enabling /d2archSSE42 gets MSVC to omit checking cmp __isa_available, 2 and a fallback scalar loop (MSVC is dumb and doesn't know how to use SSE2 pcmpgtd for the fallback). godbolt.org/z/EsjW4vTne. I had to write out a conditional expression, though; MSVC failed to vectorize with std::min. GCC was fine of course. Also, pmulld is an even simpler thing to vectorize with, arr[i] *= 1234;, and yes MSVC does that, too. – Conscientious 25/9, 2021 at 18:41

@PeterCordes, the current MSVC can use popcount, but it is used unconditionally only starting with /arch:AVX, otherwise there's a branch with detection. That's because the use of popcount is coded in the library implementation, and there's no way for a library to query for SSE4.2 or BMI in MSVC -- see godbolt.org/z/j6cd58931 – Halfprice 15/5, 2023 at 9:30

Recommended topics

Hot tags