Why did CRITICAL_SECTION performance become worse on Win8

Asked 4/9, 2018 at 16:38 Answered 1/3, 2019 at 17:18

c++c++11 winapi critical-section stdmutex

It seems like CRITICAL_SECTION performance became worse on Windows 8 and higher. (see graphs below)

The test is pretty simple: some concurrent threads do 3 million locks each to access a variable exclusively. You can find the C++ program at the bottom of the question. I run the test on Windows Vista, Windows 7, Windows 8, Windows 10 (x64, VMWare, Intel Core i7-2600 3.40GHz).

The results are on the image below. The X-axis is the number of concurrent threads. The Y-axis is the elapsed time in seconds (lower is better).

What we can see:

SRWLock performance is approximately the same for all platforms
CriticalSection performance became worse relatively SRWL on Windows 8 and higher

The question is: Can anybody please explain why did CRITICAL_SECTION performance become worse on Win8 and higher?

Some notes:

The results on real machines are pretty the same - CS is much worse than both std::mutex, std::recursive_mutex and SRWL on Win8 and higher. However I have no chance to run the test on different OSes with the same CPU.
std::mutex implementation for Windows Vista is based on CRITICAL_SECTION, but for Win7 and higher std::mutex is based on SWRL. It is correct for both MSVS17 and 15 (To make sure search for primitives.h file at MSVC++ installation and look for stl_critical_section_vista and stl_critical_section_win7 classes) This explains the difference between std::mutex performance on Win Vista and others.
As it is said in comments, the std::mutex is a wrapper, so the possible explanation for some overhead relatively SRWL may be overhead introduced by the wrapper code.

#include <chrono>
#include <iostream>
#include <mutex>
#include <string>
#include <thread>
#include <vector>

#include <Windows.h>

const size_t T = 10;
const size_t N = 3000000;
volatile uint64_t var = 0;

const std::string sep = ";";

namespace WinApi
{
    class CriticalSection
    {
        CRITICAL_SECTION cs;
    public:
        CriticalSection() { InitializeCriticalSection(&cs); }
        ~CriticalSection() { DeleteCriticalSection(&cs); }
        void lock() { EnterCriticalSection(&cs); }
        void unlock() { LeaveCriticalSection(&cs); }
    };

    class SRWLock
    {
        SRWLOCK srw;
    public:
        SRWLock() { InitializeSRWLock(&srw); }
        void lock() { AcquireSRWLockExclusive(&srw); }
        void unlock() { ReleaseSRWLockExclusive(&srw); }
    };
}

template <class M>
void doLock(void *param)
{
    M &m = *static_cast<M*>(param);
    for (size_t n = 0; n < N; ++n)
    {
        m.lock();
        var += std::rand();
        m.unlock();
    }
}

template <class M>
void runTest(size_t threadCount)
{
    M m;
    std::vector<std::thread> thrs(threadCount);

    const auto start = std::chrono::system_clock::now();

    for (auto &t : thrs) t = std::thread(doLock<M>, &m);
    for (auto &t : thrs) t.join();

    const auto end = std::chrono::system_clock::now();

    const std::chrono::duration<double> diff = end - start;
    std::cout << diff.count() << sep;
}

template <class ...Args>
void runTests(size_t threadMax)
{
    {
        int dummy[] = { (std::cout << typeid(Args).name() << sep, 0)... };
        (void)dummy;
    }
    std::cout << std::endl;

    for (size_t n = 1; n <= threadMax; ++n)
    {
        {
            int dummy[] = { (runTest<Args>(n), 0)... };
            (void)dummy;
        }
        std::cout << std::endl;
    }
}

int main()
{
    std::srand(time(NULL));
    runTests<std::mutex, WinApi::CriticalSection, WinApi::SRWLock>(T);
    return 0;
}

The test project was built as Windows Console Application on Microsoft Visual Studio 17 (15.8.2) with the folowing settings:

Use of MFC: Use MFC in a Static Library
Windows SDK Version: 10.0.17134.0
Platform Toolset: Visual Studio 2017 (v141)
Optimization: O2, Oi, Oy-, GL

Morley answered 4/9, 2018 at 16:38 Comment(26)

There are some differences in semantics between SRWLock and Critical Section, have a read of: #3499298 – Thitherto 4/9, 2018 at 16:52

I had a quick look at std::mutex implementation in my environment (Win7, VS2015) -- there is one layer of indirection on top of whatever OS primitive chosen by std::mutex (see _Mtx_storage + _Mtx_init_in_situ/etc functions used to operate the primitive). This may explain some of observed performance reduction. – Ashworth 4/9, 2018 at 17:22

Your use of std::rand makes me worried about thread safety. – First 4/9, 2018 at 17:32

@Yakk-AdamNevraumont what’s wrong with it? – Morley 4/9, 2018 at 17:34

"...It is implementation-defined whether rand() is thread-safe....": en.cppreference.com/w/cpp/numeric/random/rand It maintains an internal state, what happens if 2 threads try and simultaneously mutate this state. – Thitherto 4/9, 2018 at 17:36

@Morley It is shared state between threads with semantics undefined by the C++ standard. In code attempting to profile multi-threaded performance. – First 4/9, 2018 at 17:37

Do you understand that finding why all this happens will require quite some time? I mean the graphs are cool and all but I doubt anyone can answer such a question offhand, it requires thorough investigation. – Wallen 4/9, 2018 at 17:44

You might want to also profile the non-lock struct NoLock{void lock() {} void unlock() {}}; -- see how much of the cost is from rand() and how much from locking. – First 4/9, 2018 at 17:51

internal implementation of SRW locks and CS changed from one windows version to another. the CS is more complex and containing visible more code/checks compare SRW. from another side your code inside "crit sec" too small. try do say SwitchToThread() or CreateFileW+CloseHandle - some more time/job inside critical region and compare difference in this case. SRW anyway will be faster, but not so. the mutex is shell over SRW, as result always will be bit slow compare it, but may be on vista another implementation, this can explain – Hostetter 4/9, 2018 at 17:56

@Yakk-AdamNevraumont In general the standard says that rand is not thread safe. Do you mean the rand() call under the locked mutex is not thread-safe enough? Anyway I did experiments with “var += 1” and others, and the results are the same. – Morley 4/9, 2018 at 18:36

Regarding the discussion surrounding srand/rand, visual c++ uses a thread-local random seed so having two threads executing rand() at the same time will not interfere with each other. But also note this means that each thread needs to call srand to initialize the RNG. – Apodal 4/9, 2018 at 18:44

@Apodal thanks, but the goal of the test is not randomly increase the variable, the goal is time measurement. so if any srand misusing occurs here, it doesn’t affect the test. – Morley 4/9, 2018 at 18:54

@Apodal That sounds better than I feared; I was worried about possible contention. – First 4/9, 2018 at 19:23

For reliable results you should probably run the tests on real devices. Maybe you are measuring certain aspects of VMWare performance more than anything else. @ixs: That's understood. Some questions are harder to answer or take more time than others. – Sequent 4/9, 2018 at 21:3

@Hostetter I tried to insert std::this_thread::yield() call under the locked mutex. On Win10 the results for std::mutex and SRWL are pretty the same, but CS is still worse than std::mutex 10-25% depending on number of threads. – Morley 5/9, 2018 at 9:28

@Morley - this is anticipated because CS is more complex. std::this_thread::yield() is also too small job. try for example if (HANDLE hEvent = CreateEvent(0,0,0,0)) { CloseHandle(hEvent); } – Hostetter 5/9, 2018 at 9:32

I updated the question with some explanations, so the questions 1 and 2 seem answered now. The question 3 is still waiting for an answer. – Morley 5/9, 2018 at 11:52

@Sequent The results on real machines are pretty the same - CS is much worse then both std::mutex, std::recursive_mutex and SRWL on Win8 and higher. However I have no chance to run the test on different OSes with the same CPU. I'd publish the results here, but on the other hand anyone else may start talking about CPU differences, etc. So I guess the publishing results from real machines doesn't make sense. – Morley 5/9, 2018 at 12:59

Critical sections are optimized for low contention scenarios. Yours is the exact opposite: Guaranteed continuous contention. – Gerdagerdeen 5/9, 2018 at 17:16

@RaymondChen Yes, but this doesn’t explain why CS became worse on Win8. – Morley 5/9, 2018 at 18:21

That I cannot explain. Just pointing out that you are using critical sections in a way they were not optimized for. – Gerdagerdeen 5/9, 2018 at 19:46

Meltdown-Spectre patch ? Could you test on AMD also ? – Scandinavian 1/3, 2019 at 18:27

Though std::mutex is as wrapper on SRWL, it may perform worse due to its ability to fall-back to implementation that does not use SRWL. Calls to implementation is done using pointers-to-function, and runtime library is complied with security options enabled, so Control Flow Guard chimes in. – Cropper 2/3, 2019 at 17:47

@AlexanderGutenev The question was about CriticalSection, not std::mutex. – Morley 5/3, 2019 at 14:8

I see, I just explained an observation from Some notes below question. – Cropper 5/3, 2019 at 14:16

@AlexanderGutenev Ah, OK, thanks! – Morley 5/3, 2019 at 14:18

See Windows Critical Section - how to disable spinning completely Starting from Windows 8, Microsoft changed implementation (without even a word in documentation) of default behavior of Critical Section (if you use InitializeCriticalSection(&cs), you will get spinning with undocumented dynamic spin adjustment algorithm enabled). See my comment here: https://randomascii.wordpress.com/2012/06/05/in-praise-of-idleness/#comment-57420

For your test, try using InitializeCriticalSectionAndSpinCount(&cs,1) instead of InitializeCriticalSection(&cs). This should make it behave somewhat similar to Windows 7, though there are plenty of other changes in that area.

Caltanissetta answered 1/3, 2019 at 17:18 Comment(5)

What are other changes in that area you refer to? I know there were a lot of changes throughout history, like adding keyed events, or changing from fair algorithm to unfair, but I don't know any other changes between Windows 7 and Windows 10, except this automatic spin. – Cropper 2/3, 2019 at 7:23

Actually it's somewhere between std::mutex and CriticalSection in case I use InitializeCriticalSectionAndSpinCount(&cs,1), but it's still much closer to CriticalSection. So your explanation doesn't look like the root cause. – Morley 6/3, 2019 at 10:9

@Morley any luck solving the mystery? Maybe mark this as the answer. It sheds pretty much light on the case, even if not explaining that in 100%. – Hundred 26/6, 2020 at 20:11

@Hundred As I mentioned before, this answer doesn't seem like the root cause. The most possible reason is there was some problem in Windows updates. As far as I know (but I'm not sure) the problem can't be reproduced now with all the latest updates installed. – Morley 3/7, 2020 at 13:28

@Morley It's actually a quite likely root cause to the bench mark results observed. There is a real shared cacheline in this example, so any algorithm which doesn't spin and re-schedules delayed is avoiding several future cache misses as the contention is temporarily gone. And the cost of re-scheduling has reduced a lot compared to Vista, so burning memory transactions on spinning is no longer worth it at all. – Pickford 28/9, 2021 at 12:4

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags