MSVC and ICC only use instructions that do alignment checking when they fold a load into a memory source operand without AVX enabled, like addps xmm0, [rax]
. SSE memory source operands require alignment, unlike AVX. But you can't reliably control when this happens, and in debug builds it generally doesn't.
As Mysticial points out in Visual Studio 2017: _mm_load_ps often compiled to movups , another case is NT load/store, because there is no unaligned version.
If your code is compatible with clang-cl
, have Visual Studio use it instead of MSVC. It's a modified version of clang that tries to act more like MSVC. But like GCC, clang uses aligned load and store instructions for aligned intrinsics.
Either disable optimization, or make sure AVX is not enabled, otherwise it could fold a _mm_load_ps
into a memory source operand like vaddps xmm0, [rax]
which doesn't require alignment because it's the AVX version. This may be a problem if your code also uses AVX intrinsics in the same file, because clang requires that you enable ISA extensions for intrinsics you want to use; the compiler won't emit asm instructions for an extension that isn't enabled, even with intrinsics. Unlike MSVC and ICC.
A debug build should work even with AVX enabled, especially if you _mm_load_ps
or _mm256_load_ps
into a separate variable in a separate statement, not v=_mm_add_ps(v, _mm_load_ps(ptr));
With MSVC itself, for debugging purposes only (usually very big speed penalty for stores), you could substitute normal loads/stores with NT. Since they're special, the compiler won't fold loads into memory source operands for ALU instructions, so this can maybe work even with AVX with optimization enabled.
// alignment_debug.h (untested)
// #include this *after* immintrin.h
#ifdef DEBUG_SIMD_ALIGNMENT
#warn "using slow alignment-debug SIMD instructions to work around MSVC/ICC limitations"
// SSE4.1 MOVNTDQA doesn't do anything special on normal WB memory, only WC
// On WB, it's just a slower MOVDQA, wasting an ALU uop.
#define _mm_load_si128 _mm_stream_load_si128
#define _mm_load_ps(ptr) _mm_castsi128_ps(_mm_stream_load_si128((const __m128i*)ptr))
#define _mm_load_pd(ptr) _mm_castsi128_pd(_mm_stream_load_si128((const __m128i*)ptr))
// SSE1/2 MOVNTPS / PD / MOVNTDQ evict data from cache if it was hot, and bypass cache
#define _mm_store_ps _mm_stream_ps // SSE1 movntps
#define _mm_store_pd _mm_stream_pd // SSE2 movntpd is a waste of space vs. the ps encoding, but whatever
#define _mm_store_si128 _mm_stream_si128 // SSE2 movntdq
// and repeat for _mm256_... versions with _mm256_castsi256_ps
// and _mm512_... versions
// edit welcome if anyone tests this and adds those versions
#endif
Related: for auto-vectorization with MSVC (and gcc/clang), see Alex's answer on Alignment attribute to force aligned load/store in auto-vectorization of GCC/CLang
movdqu
loads / stores when the address doesn't cross a cache-line boundary (which includes the aligned case). AMD since Bulldozer has no penalty when it does cross a 32 or maybe 16-byte boundary. K10 has a penalty formovups
stores but not loads even on aligned addresses, Core 2 has a penalty for both. I don't use MSVC so IDK if it has any tune option that takes those old CPUs into consideration. (gcc / clang usemovaps
whenever there's a compile-time alignment guarantee, so using one of those would be another option.) – Lorant