Visual Studio 2017: _mm_load_ps often compiled to movups
Asked Answered
C

1

9

I am looking at the generated assembly for my code (using Visual Studio 2017) and noticed that _mm_load_ps is often (always?) compiled to movups.

The data I'm using _mm_load_ps on is defined like this:

struct alignas(16) Vector {
    float v[4];
}

// often embedded in other structs like this
struct AABB {
    Vector min;
    Vector max;
    bool intersection(/* parameters */) const;
}

Now when I'm using this construct, the following will happen:

// this code
__mm128 bb_min = _mm_load_ps(min.v);

// generates this
movups  xmm4, XMMWORD PTR [r8]

I'm expecting movaps because of alignas(16). Do I need something else to convince the compiler to use movaps in this case?

EDIT: My question is different from this question because I'm not getting any crashes. The struct is specifically aligned and I'm also using aligned allocation. Rather, I'm curious why the compiler is switching _mm_load_ps (the intrinsic for aligned memory) to movups. If I know struct was allocated at an aligned address and I'm calling it via this* it would be safe to use movaps, right?

Carolus answered 9/3, 2017 at 13:49 Comment(17)
For what purpose do you specifically want a movaps?Swithbart
@Swithbart He's moving four floats and aligned instructions are often more performant, particularly on some generations of cpu.Actinomycosis
Possible duplicate of SSE, intrinsics, and alignmentActinomycosis
@Actinomycosis yea Core2. Doesn't matter on anything newer as far as I know, as long as the address is actually alignedSwithbart
tldr; alignas isn't perfect or a guarantee, memcpy can put these structs anywhere (including unaligned locations), malloc won't always give you aligned memory, etc. See the dupe - you generally need to write your own allocator using _aligned_malloc.Actinomycosis
also, read through the Remarks section here. (This refers to __declspec(align(#)), but since VS2015 alignas support is implemented as veneer for same).Actinomycosis
The discussion here is also interesting.Carolus
It is by definition safe to use movaps to implement _mm_load_ps (regardless of actual alignment), it just apparently didn't happenSwithbart
@harold: OK, but is that something I can influence? (Apart from writing assembler code)Carolus
You need to show a complete example that demonstrates the problem, including the compiler options you've used and the version of Visual Studio 2017 you're using.Selfregulating
@Swithbart No movaps will certainly cause an exception with an unaligned address.Actinomycosis
@Actinomycosis yes and _mm_load_ps is allowed to do that too, though it doesn't have toSwithbart
On VS and ICC, if you compile for AVX or higher, the compiler almost never issues aligned SIMD load/stores. It's allowed to do that since it's not a loss of functionality and all processors starting from Nehalem have no penalty for using unaligned load/stores when the address is aligned. They do it because it makes the compiler simpler (not have to choose between aligned/unaligned) and it doesn't crash if it's misaligned. Though I strongly disagree with that latter one since I'd much prefer that it actually crash on misalignment since that's a bug that should be fixed, not hidden.Magnification
@Mystical: That's good information, but I just compile for x64. Does the same apply there?Carolus
@Magnification Your answer sounds pretty convincing to me. Maybe post it as an actual answer if you have the timeHart
@GuillaumeGris Done.Magnification
Related: Is there a way to force visual studio to generate aligned sse intrinsics - maybe not.Graffito
M
11

On recent versions of Visual Studio and the Intel Compiler (recent as post-2013?), the compiler rarely ever generates aligned SIMD load/stores anymore.

When compiling for AVX or higher:

  • The Microsoft compiler (>VS2013?) doesn't generate aligned loads. But it still generates aligned stores.
  • The Intel compiler (> Parallel Studio 2012?) doesn't do it at all anymore. But you'll still see them in ICC-compiled binaries inside their hand-optimized libraries like memset().
  • As of GCC 6.1, it still generates aligned load/stores when you use the aligned intrinsics.

The compiler is allowed to do this because it's not a loss of functionality when the code is written correctly. All processors starting from Nehalem have no penalty for unaligned load/stores when the address is aligned.

Microsoft's stance on this issue is that it "helps the programmer by not crashing". Unfortunately, I can't find the original source for this statement from Microsoft anymore. In my opinion, this achieves the exact opposite of that because it hides misalignment penalties. From the correctness standpoint, it also hides incorrect code.

Whatever the case is, unconditionally using unaligned load/stores does simplify the compiler a bit.

New Relevations:

  • Starting Parallel Studio 2018, the Intel Compiler no longer generates aligned moves at all - even for pre-Nehalem targets.
  • Starting from Visual Studio 2017, the Microsoft Compiler also no longer generates aligned moves at all - even when targeting pre-AVX hardware.

Both cases result in inevitable performance degradation on older processors. But it seems that this is intentional as both Intel and Microsoft no longer care about old processors.


The only load/store intrinsics that are immune to this are the non-temporal load/stores. There is no unaligned equivalent of them, so the compiler has no choice.

So if you want to just test for correctness of your code, you can substitute in the load/store intrinsics for non-temporal ones. But be careful not to let something like this slip into production code since NT load/stores (NT-stores in particular) are a double-edged sword that can hurt you if you don't know what you're doing.

Magnification answered 2/8, 2017 at 16:45 Comment(7)
Related: gcc also really likes alignment when auto-vectorizing, and goes scalar until an alignment boundary (with fully-unrolled intro/cleanup code, which is a lot of code-bloat with AVX2 and small elements). It does this even with -mtune=skylake or something. Anyway, making sure gcc knows about any alignment guarantees you can give it will reduce code-bloat and avoid a conditional branch or two when auto-vectorizing.Graffito
NT load on write-back memory runs exactly identical to a normal load, on Intel Sandybridge-family at least. They could have made it work somewhat like prefetchNTA, but didn't (probably because it would need hardware prefetchers that were NT-aware for it to not suck). (Working on an update to #32104468; turns out my guess was wrong that it did something like fetching into only one way of cache to avoid pollution. Only pfNTA does that.)Graffito
@PeterCordes Interestingly, the NT load throughput is only 1/cycle on Skylake X as opposed to 2/cycle for all other loads. (according to AIDA64)Magnification
On Skylake-S (desktop), reloading the same 64 bytes with movntdqa xmm0, [rsi] / movntdqa xmm1, [rsi+16], etc. it runs ~1.71 per clock, vs. 2.0 per clock for movdqa. So even for the most trivial case, it's slower. Thanks for pointing that out.Graffito
Those AIDA64 numbers show that AVX512 EVEX vmovntdqa (1 per 1.08) is different from regular SSE or AVX VEX movntdqa (1 per 0.52). And that EVEX VMOVNTDQA + VMOVNTDQ x/y/zmm reload/store still has terrible latency, but throughput is 1 per ~19.25c instead of being the same as latency. (And ZMM NT store/reload latency is lower than the other two sizes, which is another hint that full-cache-line NT stores are special. Being much higher single-threaded bandwidth than narrower NT stores was already a big hint.)Graffito
Yeah. I haven't tried to figure out what changed underneath. But when I toggle NT-stores in my code, the difference was drastic (something like 10 - 15%) difference. This is more than what I saw on Haswell. Granted, much of that might've had to do with the overall memory bandwidth bottleneck.Magnification
Regarding "hiding misalignment penalties": This also happens with clang/gcc and AVX if a _mm_load[u]_ps can be fused with another operation (like vaddps): godbolt.org/z/2ZL5FQ So it is also not always trivial to force clang/gcc to actually generate [v]movaps instructions.Interlocution

© 2022 - 2024 — McMap. All rights reserved.