MSVC generating unnecessary complicated instructions
Asked Answered
C

0

7

While benchmarking code involving std::optional<double>, I noticed that the code MSVC generates runs at roughly half the speed compared to the one produced by clang or gcc. After spending some time reducing the code, I noticed that MSVC apparently has issues generating code for std::optional::operator=. Using std::optional::emplace() does not exhibit the slow down.

The following function

void test_assign(std::optional<double> & f){
    f = std::optional{42.0};
}

produces

sub     rsp, 24
vmovsd  xmm0, QWORD PTR __real@4045000000000000
mov     BYTE PTR $T1[rsp+8], 1
vmovups xmm1, XMMWORD PTR $T1[rsp]
vmovsd  xmm1, xmm1, xmm0
vmovups XMMWORD PTR [rcx], xmm1
add     rsp, 24
ret     0

Notice the unaligned mov operations. On the contrary, the function

void test_emplace(std::optional<double> & f){
    f.emplace(42.0);
}

compiles to

mov     rax, 4631107791820423168      ; 4045000000000000H
mov     BYTE PTR [rcx+8], 1
mov     QWORD PTR [rcx], rax
ret     0

This version is much simpler and faster. These were generated using MSVC 19.32 with /O2 /std:c++17 /DNDEBUG /arch:AVX.

clang 14 with -O3 -std=c++17 -DNDEBUG -mavx produces

movabs  rax, 4631107791820423168
mov     qword ptr [rdi], rax
mov     byte ptr [rdi + 8], 1
ret

in both cases.

Replacing std::optional<double> with

struct MyOptional {
    double d;
    bool hasValue; // Required to reproduce the problem
    
    MyOptional(double v) {
        d = v;
    }

    void emplace(double v){
        d = v;
    }
};

exhibits the same issue. Apparently MSVC has some troubles with the additional bool member.

See godbolt for a live example.

Why is MSVC producing these unaligned moves? I.e. the question is not why they are unaligned rather than aligned (which wouldn't improve things according to this post). But why does MSVC produce a considerably more expensive set of instructions in the assignment case? Is this simply a bug (or missed optimization opportunity) by MSVC? Or am I missing something?

Centaur answered 25/6, 2022 at 16:9 Comment(8)
reading the gobolt code this is due to the use of that struct.still odd thoBoatload
Looks to me like it is trying to write combine the bool and double into a vector op. One of those compiler is trying to clever mis-optimizations.Maria
@user17732522 oops sorryAthwartships
#42697618Pickle
@user17732522 Yes, sorry, I just fixed it. Besides this, I don't think that the answer from the other question answers the problem here. The other answer basically says "unaligned load/stores do not cost anything compared to aligned load/stores". But in the case here the compiler generates a bunch of additional instructions (regardless if they are unaligned or whatever) that are unnecessary in the first place (as shown by clang). And the additional instruction do cost performance.Centaur
If it had put the entire 16-byte std::optional<double> in static memory, and copied it with vmovups xmm1, [static_constant] / vmovups [rcx], xmm1 it might not have been so bad. But instead it puts only the double constant in static memory, constructs the std::optional<double> on the stack, and then copies it to its destination.Vipul
@NateEldredge: Actually it doesn't ever copy the double to the stack, it stores a 1 with a byte store, then does a 16-byte load, then replaces the low 8 bytes of the XMM register with a movsd register blend. So storing to the stack and reloading was just a way to get a 1 into byte 8 of an XMM register, with the rest being don't-care garbage. Taking 2 instructions and creating a store-forwarding stall, horrible vs. vpcmpeqd xmm1,xmm1,xmm1 / vpabsd xmm1, xmm1. And then obviously if you want to merge a new low half from memory, movlps not movsd-load + movsd xmm,xmm.Danas
Anyway, yeah, MSVC is fairly widely considered / known not to be as good an optimizing compiler as clang or GCC. The problem isn't that the vmovups instructions aren't vmovaps; the address likely is 16-byte aligned, since it knows the incoming stack alignment. The problem is the store-forwarding stall from narrow store, wide reload! (GCC isn't immune from shooting itself in the foot, too, though: Bubble sort slower with -O3 than -O2 with GCC)Danas

© 2022 - 2024 — McMap. All rights reserved.