While benchmarking code involving std::optional<double>
, I noticed that the code MSVC generates runs at roughly half the speed compared to the one produced by clang or gcc. After spending some time reducing the code, I noticed that MSVC apparently has issues generating code for std::optional::operator=
. Using std::optional::emplace()
does not exhibit the slow down.
The following function
void test_assign(std::optional<double> & f){
f = std::optional{42.0};
}
produces
sub rsp, 24
vmovsd xmm0, QWORD PTR __real@4045000000000000
mov BYTE PTR $T1[rsp+8], 1
vmovups xmm1, XMMWORD PTR $T1[rsp]
vmovsd xmm1, xmm1, xmm0
vmovups XMMWORD PTR [rcx], xmm1
add rsp, 24
ret 0
Notice the unaligned mov operations. On the contrary, the function
void test_emplace(std::optional<double> & f){
f.emplace(42.0);
}
compiles to
mov rax, 4631107791820423168 ; 4045000000000000H
mov BYTE PTR [rcx+8], 1
mov QWORD PTR [rcx], rax
ret 0
This version is much simpler and faster.
These were generated using MSVC 19.32 with /O2 /std:c++17 /DNDEBUG /arch:AVX
.
clang 14 with -O3 -std=c++17 -DNDEBUG -mavx
produces
movabs rax, 4631107791820423168
mov qword ptr [rdi], rax
mov byte ptr [rdi + 8], 1
ret
in both cases.
Replacing std::optional<double>
with
struct MyOptional {
double d;
bool hasValue; // Required to reproduce the problem
MyOptional(double v) {
d = v;
}
void emplace(double v){
d = v;
}
};
exhibits the same issue. Apparently MSVC has some troubles with the additional bool
member.
See godbolt for a live example.
Why is MSVC producing these unaligned moves? I.e. the question is not why they are unaligned rather than aligned (which wouldn't improve things according to this post). But why does MSVC produce a considerably more expensive set of instructions in the assignment case? Is this simply a bug (or missed optimization opportunity) by MSVC? Or am I missing something?
std::optional<double>
in static memory, and copied it withvmovups xmm1, [static_constant] / vmovups [rcx], xmm1
it might not have been so bad. But instead it puts only thedouble
constant in static memory, constructs thestd::optional<double>
on the stack, and then copies it to its destination. – Vipuldouble
to the stack, it stores a1
with a byte store, then does a 16-byte load, then replaces the low 8 bytes of the XMM register with amovsd
register blend. So storing to the stack and reloading was just a way to get a1
into byte 8 of an XMM register, with the rest being don't-care garbage. Taking 2 instructions and creating a store-forwarding stall, horrible vs.vpcmpeqd xmm1,xmm1,xmm1
/vpabsd xmm1, xmm1
. And then obviously if you want to merge a new low half from memory,movlps
notmovsd
-load +movsd xmm,xmm
. – Danasvmovups
instructions aren'tvmovaps
; the address likely is 16-byte aligned, since it knows the incoming stack alignment. The problem is the store-forwarding stall from narrow store, wide reload! (GCC isn't immune from shooting itself in the foot, too, though: Bubble sort slower with -O3 than -O2 with GCC) – Danas