std::mutex performance compared to win32 CRITICAL_SECTION
Asked Answered
B

6

49

how does the performance of std::mutex compared to CRITICAL_SECTION? is it on par?

I need lightweight synchronization object (doesn't need to be an interprocess object) is there any STL class that close to CRITICAL_SECTION other than std::mutex ?

Baronetcy answered 3/4, 2012 at 16:26 Comment(7)
Pedantry: std::mutex is not an STL class. Neither is any other synchronization object.Influx
@uray: no, it is a part of the C++11 standard library. STL is a sort of ambiguous term, but it is usually taken to refer to the collections, algorithms , functors and iterators in the standard library. std::mutex, memcpy or std::ofstream, for example, are not typically considered part of the STLBellerophon
ok, sorry my reply already deleted after u replyBaronetcy
Please read this discussion: https://mcmap.net/q/73791/-what-39-s-the-difference-between-quot-stl-quot-and-quot-c-standard-library-quot/10077Unparliamentary
If you're worried about mutex performance, you're doing something very wrong. Also: what kind of synchronization do you need?Metallurgy
@zvrba: I'am not worried, just want to use the best available library for simple synchronization objectBaronetcy
See also https://mcmap.net/q/356605/-why-did-critical_section-performance-become-worse-on-win8/717732 for some 5+ years later infoLanneret
D
54

Please see my updates at the end of the answer, the situation has dramatically changed since Visual Studio 2015. The original answer is below.

I made a very simple test and according to my measurements the std::mutex is around 50-70x slower than CRITICAL_SECTION.

std::mutex:       18140574us
CRITICAL_SECTION: 296874us

Edit: After some more tests it turned out it depends on number of threads (congestion) and number of CPU cores. Generally, the std::mutex is slower, but how much, it depends on use. Following are updated test results (tested on MacBook Pro with Core i5-4258U, Windows 10, Bootcamp):

Iterations: 1000000
Thread count: 1
std::mutex:       78132us
CRITICAL_SECTION: 31252us
Thread count: 2
std::mutex:       687538us
CRITICAL_SECTION: 140648us
Thread count: 4
std::mutex:       1031277us
CRITICAL_SECTION: 703180us
Thread count: 8
std::mutex:       86779418us
CRITICAL_SECTION: 1634123us
Thread count: 16
std::mutex:       172916124us
CRITICAL_SECTION: 3390895us

Following is the code that produced this output. Compiled with Visual Studio 2012, default project settings, Win32 release configuration. Please note that this test may not be perfectly correct but it made me think twice before switching my code from using CRITICAL_SECTION to std::mutex.

#include "stdafx.h"
#include <Windows.h>
#include <mutex>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>

const int g_cRepeatCount = 1000000;
const int g_cThreadCount = 16;

double g_shmem = 8;
std::mutex g_mutex;
CRITICAL_SECTION g_critSec;

void sharedFunc( int i )
{
    if ( i % 2 == 0 )
        g_shmem = sqrt(g_shmem);
    else
        g_shmem *= g_shmem;
}

void threadFuncCritSec() {
    for ( int i = 0; i < g_cRepeatCount; ++i ) {
        EnterCriticalSection( &g_critSec );
        sharedFunc(i);
        LeaveCriticalSection( &g_critSec );
    }
}

void threadFuncMutex() {
    for ( int i = 0; i < g_cRepeatCount; ++i ) {
        g_mutex.lock();
        sharedFunc(i);
        g_mutex.unlock();
    }
}

void testRound(int threadCount)
{
    std::vector<std::thread> threads;

    auto startMutex = std::chrono::high_resolution_clock::now();
    for (int i = 0; i<threadCount; ++i)
        threads.push_back(std::thread( threadFuncMutex ));
    for ( std::thread& thd : threads )
        thd.join();
    auto endMutex = std::chrono::high_resolution_clock::now();

    std::cout << "std::mutex:       ";
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endMutex - startMutex).count();
    std::cout << "us \n\r";

    threads.clear();
    auto startCritSec = std::chrono::high_resolution_clock::now();
    for (int i = 0; i<threadCount; ++i)
        threads.push_back(std::thread( threadFuncCritSec ));
    for ( std::thread& thd : threads )
        thd.join();
    auto endCritSec = std::chrono::high_resolution_clock::now();

    std::cout << "CRITICAL_SECTION: ";
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endCritSec - startCritSec).count();
    std::cout << "us \n\r";
}

int _tmain(int argc, _TCHAR* argv[]) {
    InitializeCriticalSection( &g_critSec );

    std::cout << "Iterations: " << g_cRepeatCount << "\n\r";

    for (int i = 1; i <= g_cThreadCount; i = i*2) {
        std::cout << "Thread count: " << i << "\n\r";
        testRound(i);
        Sleep(1000);
    }

    DeleteCriticalSection( &g_critSec );

    // Added 10/27/2017 to try to prevent the compiler to completely
    // optimize out the code around g_shmem if it wouldn't be used anywhere.
    std::cout << "Shared variable value: " << g_shmem << std::endl;
    getchar();
    return 0;
}

Update 10/27/2017 (1): Some answers suggest that this is not a realistic test or does not represent a "real world" scenario. That's true, this test tries to measure the overhead of the std::mutex, it's not trying to prove that the difference is negligible for 99% of applications.

Update 10/27/2017 (2): Seems like the situation has changed in favor for std::mutex since Visual Studio 2015 (VC140). I used VS2017 IDE, exactly the same code as above, x64 release configuration, optimizations disabled and I simply switched the "Platform Toolset" for each test. The results are very surprising and I am really curious what has hanged in VC140.

Performance with 8 threads, by platform

Update 02/25/2020 (3): Reran the test with Visual Studio 2019 (Toolset v142), and situation is still the same: std::mutex is two to three times faster than CRITICAL_SECTION.

Dinin answered 28/6, 2014 at 20:55 Comment(12)
Hopefully we can make this better :)Prevaricate
I have used your code below for benchmarking (I know, as the other answer pointed, it's heavy on contention, but I'm interested in such a bench for some other discussion). Some nitpicks on your code: 1. Windows.h should be windows.h, 2. there's no need to use _tmain which is not even from standard, and 3. you can remove all the arguments from the main too as you don't actually use them (alternatively: use char** instead of non-standard _TCHAR*).Beacon
std::mutex is implemented in VC 2015 using an SRW lock.Ouch
@Beacon 1. Why should "windows.h" be written in lower case? Microsoft always refers to this as "Windows.h", e.g. msdn.microsoft.com/en-us/library/windows/desktop/ms682608.aspx. 2. The use of _tmain() makes the code "Unicode agile", which is useful. It is non-standard but so is the rest of the code. 3. Use "char *argv[]" instead of "char **argv".Sonatina
@Sonatina the file is lower-case, and writing it differently gonna cause compilation errors on most file-systems (e.g. ext2/3/4, btrfs, xfs, reiserfs, et cetera). I know it because I've experienced the exact problem for testing the code from the answer. And what does "Unicode agile" mean?Beacon
@Beacon In all the Windows SDKs I currently have installed (7.1, 8.0, 8.1, 10.0) the filename has an uppercase "W". But on Windows (with default settings) case does not matter anyway. "Unicode agile" means you can build your application using ANSI/CodePage or UTF-16 (aka Unicode on Windows) characters. Which means the "_tmain" is translated to: UNICODE not defined: "int main(int argc, char *argv[])" or UNICODE defined: "int wmain(int argc, wchar_t *argv[]"Sonatina
@Sonatina you do understand that, even if we're only talking about the UTF16 encoding of Unicode, using wchar_t is pointless, don't you? Unicode encoding is a very complicated matter, starting with a basic that 16 in UTF16 just means "a character take at least 2 bytes", and ending with complicated matters like combining characters. You can't just peek at multiple-of-2 offset, and be sure you're alright at the same time. Also, FYI, an interesting article.Beacon
@Beacon Yes I'm (painfully) aware that it takes a lot more than just switching character types. And using MBCS or UTF-8 also uses multiple bytes, so for us it has never been valid to use just plain byte offsets. But if you do it right, with some help from "Unicode-agile" string classes, you can switch between Unicode and non-Unicode version of your application at compile time. But maintaining such a duality is additional work so you might not want to, and we should (at least for new applications) focus on making a Unicode-version.Sonatina
Using wchar_t is by no means pointless, it is the "native" type for Unicode APIs on Windows. It is by far the best option if your app is Windows only. Making such code portable to UTF-8 platforms like Linux can be painful, and a 32-bit wchar_t makes it even worse.Sonatina
It would be interesting to see boost::mutex in the same test.Donnelly
See also https://mcmap.net/q/356605/-why-did-critical_section-performance-become-worse-on-win8/717732 - it seems that recent implementations of CRITICAL_SECTIONS does some spinning when initialized without magic parameterLanneret
most likely reason for this change was dropping support for pre-win7 in the internal implementation. Windows 7 implementation uses SRWLock, Windows Vista version used CRITICAL_SECTION. There also used to be an implementation using Concurrency Runtime critical sections, not sure when that was removed.Watercress
F
25

The test by waldez here is not realistic, it basically simulates 100% contention. In general this is exactly what you don't want in multi-threaded code. Below is a modified test which does some shared calculations. The results I get with this code are different:

Tasks: 160000
Thread count: 1
std::mutex:       12096ms
CRITICAL_SECTION: 12060ms
Thread count: 2
std::mutex:       5206ms
CRITICAL_SECTION: 5110ms
Thread count: 4
std::mutex:       2643ms
CRITICAL_SECTION: 2625ms
Thread count: 8
std::mutex:       1632ms
CRITICAL_SECTION: 1702ms
Thread count: 12
std::mutex:       1227ms
CRITICAL_SECTION: 1244ms

You can see here that for me (using VS2013) the figures are very close between std::mutex and CRITICAL_SECTION. Note that this code does a fixed number of tasks (160,000) which is why the performance improves generally with more threads. I've got 12 cores here so that's why I stopped at 12.

I'm not saying this is right or wrong compared to the other test but it does highlight that timing issues are generally domain specific.

#include "stdafx.h"
#include <Windows.h>
#include <mutex>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>

const int tastCount = 160000;
int numThreads;
const int MAX_THREADS = 16;

double g_shmem = 8;
std::mutex g_mutex;
CRITICAL_SECTION g_critSec;

void sharedFunc(int i, double &data)
{
    for (int j = 0; j < 100; j++)
    {
        if (j % 2 == 0)
            data = sqrt(data);
        else
            data *= data;
    }
}

void threadFuncCritSec() {
    double lMem = 8;
    int iterations = tastCount / numThreads;
    for (int i = 0; i < iterations; ++i) {
        for (int j = 0; j < 100; j++)
            sharedFunc(j, lMem);
        EnterCriticalSection(&g_critSec);
        sharedFunc(i, g_shmem);
        LeaveCriticalSection(&g_critSec);
    }
    printf("results: %f\n", lMem);
}

void threadFuncMutex() {
    double lMem = 8;
    int iterations = tastCount / numThreads;
    for (int i = 0; i < iterations; ++i) {
        for (int j = 0; j < 100; j++)
            sharedFunc(j, lMem);
        g_mutex.lock();
        sharedFunc(i, g_shmem);
        g_mutex.unlock();
    }
    printf("results: %f\n", lMem);
}

void testRound()
{
    std::vector<std::thread> threads;

    auto startMutex = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < numThreads; ++i)
        threads.push_back(std::thread(threadFuncMutex));
    for (std::thread& thd : threads)
        thd.join();
    auto endMutex = std::chrono::high_resolution_clock::now();

    std::cout << "std::mutex:       ";
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(endMutex - startMutex).count();
    std::cout << "ms \n\r";

    threads.clear();
    auto startCritSec = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < numThreads; ++i)
        threads.push_back(std::thread(threadFuncCritSec));
    for (std::thread& thd : threads)
        thd.join();
    auto endCritSec = std::chrono::high_resolution_clock::now();

    std::cout << "CRITICAL_SECTION: ";
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(endCritSec - startCritSec).count();
    std::cout << "ms \n\r";
}

int _tmain(int argc, _TCHAR* argv[]) {
    InitializeCriticalSection(&g_critSec);

    std::cout << "Tasks: " << tastCount << "\n\r";

    for (numThreads = 1; numThreads <= MAX_THREADS; numThreads = numThreads * 2) {
        if (numThreads == 16)
            numThreads = 12;
        Sleep(100);
        std::cout << "Thread count: " << numThreads << "\n\r";
        testRound();
    }

    DeleteCriticalSection(&g_critSec);
    return 0;
}
Footlight answered 23/2, 2015 at 11:46 Comment(1)
When there is no contention you will get identical results. If not, then synchronization primitives that you use are broken or simply measurements are inaccurate.Donnelly
H
2

Original answer from February 2015:

I'm using Visual Studio 2013.

My results in single threaded usage are looking similar to waldez results:

1 million of lock/unlock calls:

CRITICAL_SECTION:       19 ms
std::mutex:             48 ms
std::recursive_mutex:   48 ms

The reason why Microsoft changed implementation is C++11 compatibility. C++11 has 4 kind of mutexes in std namespace:

Microsoft std::mutex and all other mutexes are the wrappers around critical section:

    struct _Mtx_internal_imp_t
    {   /* Win32 mutex */
        int type; // here MS keeps particular mutex type
        Concurrency::critical_section cs;
        long thread_id;
        int count;
    };

As for me, std::recursive_mutex should completely match critical section. So Microsoft should optimize its implementation to take less CPU and memory.

Update from February 2023:

Fresh investigation shows the difference in std::mutex implementation in latest versions of MSVC compared to MSVC 2013. I tried the following compilers/STL and they showed the same behavior:

  • MSVC 2019 (SDK 14.29.30133)
  • MSVC 2022 (SDK 14.33.31629)

Both of them are using SRW locks for std::mutex implementation by default.

However CRT still can choosing CRITICAL_SECTION-based implementation in runtime.

Here is the modern underlaying structure definition:

struct _Mtx_internal_imp_t { // ConcRT mutex
    int type;
    std::aligned_storage_t<Concurrency::details::stl_critical_section_max_size,
        Concurrency::details::stl_critical_section_max_alignment>
        cs;
    long thread_id;
    int count;
    Concurrency::details::stl_critical_section_interface* _get_cs() { // get pointer to implementation
        return reinterpret_cast<Concurrency::details::stl_critical_section_interface*>(&cs);
    }
};

And this is how it is initialized:

void _Mtx_init_in_situ(_Mtx_t mtx, int type) { // initialize mutex in situ
    Concurrency::details::create_stl_critical_section(mtx->_get_cs());
    mtx->thread_id = -1;
    mtx->type      = type;
    mtx->count     = 0;
}

inline void create_stl_critical_section(stl_critical_section_interface* p) {
#ifdef _CRT_WINDOWS
    new (p) stl_critical_section_win7;
#else
    switch (__stl_sync_api_impl_mode) {
    case __stl_sync_api_modes_enum::normal:
    case __stl_sync_api_modes_enum::win7:
        if (are_win7_sync_apis_available()) {
            new (p) stl_critical_section_win7;
            return;
        }
        // fall through
    case __stl_sync_api_modes_enum::vista:
        new (p) stl_critical_section_vista;
        return;
    default:
        abort();
    }
#endif // _CRT_WINDOWS
}

are_win7_sync_apis_available is checking existence of API function TryAcquireSRWLockExclusive in runtime.

As you can see, create_stl_critical_section will choose stl_critical_section_vista if it is run on Windows Vista for example.

We can also force CRT to choose CRITICAL_SECTION-based implementation by calling undocumented function __set_stl_sync_api_mode:

#include <mutex>

enum class __stl_sync_api_modes_enum { normal, win7, vista, concrt };
extern "C" _CRTIMP2 void __cdecl __set_stl_sync_api_mode(__stl_sync_api_modes_enum mode);

int main()
{
    __set_stl_sync_api_mode(__stl_sync_api_modes_enum::vista);
    std::mutex protect; // now it is forced to use CRITICAL_SECTION inside
}

This works for both dynamic CRT linking (DLL) and for static CRT. But debugging of static CRT is much easier (in debug mode).

Heaviness answered 19/2, 2015 at 12:32 Comment(8)
Win32 critical sections are recursive by design, so it makes sense to have both mutex and recursive_mutex use the same implementation (if a non-recursive mutex is recursive anyway, this doesn't really hurt). However, there's no way you can implement timed_mutex in terms of a critical section (because there is no such thing as TryEnterCriticalSection with a timeout parameter). Unless the implementation is noncompliant, they must use keyed events or Win32 mutex objects.Makassar
Maybe it was not clear from my first message, but I wrote the same. Windows Critical section fully matches std::recursive_mutex (and only this mutex). So particularly std::recursive_mutex can be implemented without additional data and logic above Windows API. I also don't see any reason to keep type of mutex in memory as data member. Type of mutex is known at compile time and could be a template argument, not a class member.Heaviness
The likely reason is that the standard is worded in a pretty piss way. It doesn't state that a recursive mutex must also support being called recursively, but it states that a non-recursive mutex must not support it (and will throw resource_deadlock_would_occur). It's a precondition that the calling thread does not own the mutex. So while the compiler can use the same underlying implementation (critical section), to be standards-compliant, it must add extra code (and a type member) to comply with this nonsense even though it's completely harmless. This should actually be "unspecified".Makassar
You are talking about std::mutex (non-recursive). And I agree current MS implementation needs those wrapper around Critical section for std::mutex. But we don't need those wrapper for std::recursive_mutex!Heaviness
Agreed, we need to be better here.Prevaricate
@Damon: The standard does not mandate that we throw resource_deadlock_would_occur. See 30.4.1.2 [thread.mutex.requirements.mutex]/7: Requires: If m is of type std::mutex, std::timed_mutex, std::shared_mutex, or std::shared_timed_mutex, the calling thread does not own the mutex. and 30.4.1.2.1 [thread.mutex.class]/4: [ Note: A program may deadlock if the thread that owns a mutex object calls lock() on that object. If the implementation can detect the deadlock, a resource_deadlock_would_occur error condition may be observed. —end note ]Prevaricate
Concurrency::critical_section is not a win32 CRITICAL_SECTION. It is not recursive (and throws an exception if you attempt to lock it recursively), has a try_lock_for function, and has a more cooperative underlying implementation. Also while this might indeed have been the implementation in 2018, std::mutex is now implemented with SRWLock in Microsoft's STL.Watercress
@nfries88, I made an investigation and you are mostly right. MSVC 2019 and 2022 STL in Feb 2023 uses SRW locks by default and switches to CRITICAL_SECTION implementation if run on OS earlier than Windows 7 only. This works both for static and dynamic CRT. We can also switch back to CRITICAL_SECTION by calling undocumented CRT function __set_stl_sync_api_mode.Heaviness
B
2

I was searching here for pthread vs critical section benchmarks, however, as my result turned out to be different from the waldez's answer with regard to the topic, I thought it'd be interesting to share.

The code is the one used by @waldez, modified to add pthreads to the comparison, compiled with GCC and no optimizations. My CPU is AMD A8-3530MX.

Windows 7 Home Edition:

>a.exe
Iterations: 1000000
Thread count: 1
std::mutex:       46800us
CRITICAL_SECTION: 31200us
pthreads:         31200us
Thread count: 2
std::mutex:       171600us
CRITICAL_SECTION: 218400us
pthreads:         124800us
Thread count: 4
std::mutex:       327600us
CRITICAL_SECTION: 374400us
pthreads:         249600us
Thread count: 8
std::mutex:       967201us
CRITICAL_SECTION: 748801us
pthreads:         717601us
Thread count: 16
std::mutex:       2745604us
CRITICAL_SECTION: 1497602us
pthreads:         1903203us

As you can see, the difference varies well within statistical error — sometimes std::mutex is faster, sometimes it's not. What's important, I do not observe such big difference as the original answer.

I think, maybe the reason is that when the answer was posted, MSVC compiler wasn't good with newer standards, and note that the original answer have used the version from 2012 year.

Also, out of curiosity, same binary under Wine on Archlinux:

$ wine a.exe
fixme:winediag:start_process Wine Staging 2.19 is a testing version containing experimental patches.
fixme:winediag:start_process Please mention your exact version when filing bug reports on winehq.org.
Iterations: 1000000
Thread count: 1
std::mutex:       53810us 
CRITICAL_SECTION: 95165us 
pthreads:         62316us 
Thread count: 2
std::mutex:       604418us 
CRITICAL_SECTION: 1192601us 
pthreads:         688960us 
Thread count: 4
std::mutex:       779817us 
CRITICAL_SECTION: 2476287us 
pthreads:         818022us 
Thread count: 8
std::mutex:       1806607us 
CRITICAL_SECTION: 7246986us 
pthreads:         809566us 
Thread count: 16
std::mutex:       2987472us 
CRITICAL_SECTION: 14740350us 
pthreads:         1453991us

The waldez's code with my modifications:

#include <math.h>
#include <windows.h>
#include <mutex>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>
#include <pthread.h>

const int g_cRepeatCount = 1000000;
const int g_cThreadCount = 16;

double g_shmem = 8;
std::mutex g_mutex;
CRITICAL_SECTION g_critSec;
pthread_mutex_t pt_mutex;


void sharedFunc( int i )
{
    if ( i % 2 == 0 )
        g_shmem = sqrt(g_shmem);
    else
        g_shmem *= g_shmem;
}

void threadFuncCritSec() {
    for ( int i = 0; i < g_cRepeatCount; ++i ) {
        EnterCriticalSection( &g_critSec );
        sharedFunc(i);
        LeaveCriticalSection( &g_critSec );
    }
}

void threadFuncMutex() {
    for ( int i = 0; i < g_cRepeatCount; ++i ) {
        g_mutex.lock();
        sharedFunc(i);
        g_mutex.unlock();
    }
}

void threadFuncPTMutex() {
    for ( int i = 0; i < g_cRepeatCount; ++i ) {
        pthread_mutex_lock(&pt_mutex);
        sharedFunc(i);
        pthread_mutex_unlock(&pt_mutex);
    }
}
void testRound(int threadCount)
{
    std::vector<std::thread> threads;

    auto startMutex = std::chrono::high_resolution_clock::now();
    for (int i = 0; i<threadCount; ++i)
        threads.push_back(std::thread( threadFuncMutex ));
    for ( std::thread& thd : threads )
        thd.join();
    auto endMutex = std::chrono::high_resolution_clock::now();

    std::cout << "std::mutex:       ";
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endMutex - startMutex).count();
    std::cout << "us \n";
    g_shmem = 0;

    threads.clear();
    auto startCritSec = std::chrono::high_resolution_clock::now();
    for (int i = 0; i<threadCount; ++i)
        threads.push_back(std::thread( threadFuncCritSec ));
    for ( std::thread& thd : threads )
        thd.join();
    auto endCritSec = std::chrono::high_resolution_clock::now();

    std::cout << "CRITICAL_SECTION: ";
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endCritSec - startCritSec).count();
    std::cout << "us \n";
    g_shmem = 0;

    threads.clear();
    auto startPThread = std::chrono::high_resolution_clock::now();
    for (int i = 0; i<threadCount; ++i)
        threads.push_back(std::thread( threadFuncPTMutex ));
    for ( std::thread& thd : threads )
        thd.join();
    auto endPThread = std::chrono::high_resolution_clock::now();

    std::cout << "pthreads:         ";
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endPThread - startPThread).count();
    std::cout << "us \n";
    g_shmem = 0;
}

int main() {
    InitializeCriticalSection( &g_critSec );
    pthread_mutex_init(&pt_mutex, 0);

    std::cout << "Iterations: " << g_cRepeatCount << "\n";

    for (int i = 1; i <= g_cThreadCount; i = i*2) {
        std::cout << "Thread count: " << i << "\n";
        testRound(i);
        Sleep(1000);
    }

    getchar();
    DeleteCriticalSection( &g_critSec );
    pthread_mutex_destroy(&pt_mutex);
    return 0;
}
Beacon answered 25/10, 2017 at 23:5 Comment(3)
This test code doesn't seem to accurately reflect what any realistic code would actually do and also doesn't measure the impact of the synchronization code on other threads on the system or how the code acts when not hot in the branch prediction cache.Contexture
@DavidSchwartz I know, see this comment, and the answer. The only reason I posted it just to show that things changed since 2014 (2012?) year.Beacon
@DavidSchwartz admittedly, I could've just posted a comment that nowadays std∷mutex and critical sections doesn't differ in performance even in such non-realistic case, but, well, peoples need actual numbers, right :)Beacon
D
0

Same test program by Waldez modified to run with pthreads and boost::mutex.

On win10 pro (with intel i7-7820X 16-core cpu) I get better results from std::mutex on VS2015 update3 (and even better from boost::mutex) that from CRITICAL_SECTION:

Iterations: 1000000

Thread count: 1
std::mutex:       23403us
boost::mutex:     12574us
CRITICAL_SECTION: 19454us

Thread count: 2
std::mutex:       55031us
boost::mutex:     45263us
CRITICAL_SECTION: 187597us

Thread count: 4
std::mutex:       113964us
boost::mutex:     83699us
CRITICAL_SECTION: 605765us

Thread count: 8
std::mutex:       266091us
boost::mutex:     155265us
CRITICAL_SECTION: 1908491us

Thread count: 16
std::mutex:       633032us
boost::mutex:     300076us
CRITICAL_SECTION: 4015176us

Results for pthreads are here.

#ifdef _WIN32
#include <Windows.h>
#endif
#include <mutex>
#include <boost/thread/mutex.hpp>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>

const int g_cRepeatCount = 1000000;
const int g_cThreadCount = 16;

double g_shmem = 8;
std::recursive_mutex g_mutex;
boost::mutex g_boostMutex;

void sharedFunc(int i)
{
    if (i % 2 == 0)
        g_shmem = sqrt(g_shmem);
    else
        g_shmem *= g_shmem;
}

#ifdef _WIN32
CRITICAL_SECTION g_critSec;
void threadFuncCritSec()
{
    for (int i = 0; i < g_cRepeatCount; ++i)
    {
        EnterCriticalSection(&g_critSec);
        sharedFunc(i);
        LeaveCriticalSection(&g_critSec);
    }
}
#else
pthread_mutex_t pt_mutex;
void threadFuncPtMutex()
{
    for (int i = 0; i < g_cRepeatCount; ++i) {
        pthread_mutex_lock(&pt_mutex);
        sharedFunc(i);
        pthread_mutex_unlock(&pt_mutex);
    }
}
#endif

void threadFuncMutex()
{
    for (int i = 0; i < g_cRepeatCount; ++i)
    {
        g_mutex.lock();
        sharedFunc(i);
        g_mutex.unlock();
    }
}

void threadFuncBoostMutex()
{
    for (int i = 0; i < g_cRepeatCount; ++i)
    {
        g_boostMutex.lock();
        sharedFunc(i);
        g_boostMutex.unlock();
    }
}

void testRound(int threadCount)
{
    std::vector<std::thread> threads;

    std::cout << "\nThread count: " << threadCount << "\n\r";

    auto startMutex = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < threadCount; ++i)
        threads.push_back(std::thread(threadFuncMutex));
    for (std::thread& thd : threads)
        thd.join();
    threads.clear();
    auto endMutex = std::chrono::high_resolution_clock::now();

    std::cout << "std::mutex:       ";
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endMutex - startMutex).count();
    std::cout << "us \n\r";

    auto startBoostMutex = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < threadCount; ++i)
        threads.push_back(std::thread(threadFuncBoostMutex));
    for (std::thread& thd : threads)
        thd.join();
    threads.clear();
    auto endBoostMutex = std::chrono::high_resolution_clock::now();

    std::cout << "boost::mutex:     ";
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endBoostMutex - startBoostMutex).count();
    std::cout << "us \n\r";

#ifdef _WIN32
    auto startCritSec = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < threadCount; ++i)
        threads.push_back(std::thread(threadFuncCritSec));
    for (std::thread& thd : threads)
        thd.join();
    threads.clear();
    auto endCritSec = std::chrono::high_resolution_clock::now();

    std::cout << "CRITICAL_SECTION: ";
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endCritSec - startCritSec).count();
    std::cout << "us \n\r";
#else
    auto startPThread = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < threadCount; ++i)
        threads.push_back(std::thread(threadFuncPtMutex));
    for (std::thread& thd : threads)
        thd.join();
    threads.clear();
    auto endPThread = std::chrono::high_resolution_clock::now();

    std::cout << "pthreads:         ";
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(endPThread - startPThread).count();
    std::cout << "us \n";
#endif
}

int main()
{
#ifdef _WIN32
    InitializeCriticalSection(&g_critSec);
#else
    pthread_mutex_init(&pt_mutex, 0);
#endif

    std::cout << "Iterations: " << g_cRepeatCount << "\n\r";

    for (int i = 1; i <= g_cThreadCount; i = i * 2)
    {
        testRound(i);
        std::this_thread::sleep_for(std::chrono::seconds(1));
    }

#ifdef _WIN32
    DeleteCriticalSection(&g_critSec);
#else
    pthread_mutex_destroy(&pt_mutex);
#endif
    if (rand() % 10000 == 1)
    {
        // Added 10/27/2017 to try to prevent the compiler to completely
        // optimize out the code around g_shmem if it wouldn't be used anywhere.
        std::cout << "Shared variable value: " << g_shmem << std::endl;
    }
    return 0;
}
Donnelly answered 22/5, 2018 at 23:8 Comment(0)
A
0
My results for test1

Iterations: 1000000
Thread count: 1
std::mutex:      27085us
CRITICAL_SECTION: 12035us
Thread count: 2
std::mutex:      40412us
CRITICAL_SECTION: 119952us
Thread count: 4
std::mutex:      123214us
CRITICAL_SECTION: 314774us
Thread count: 8
std::mutex:      387737us
CRITICAL_SECTION: 1664506us
Thread count: 16
std::mutex:      836901us
CRITICAL_SECTION: 3837877us
Shared variable value: 8

for test 2

Tasks: 160000
Thread count: 1
results: 8.000000
std::mutex:       4642ms
results: 8.000000
CRITICAL_SECTION: 4588ms
Thread count: 2
results: 8.000000
results: 8.000000
std::mutex:       2309ms
results: 8.000000
results: 8.000000
CRITICAL_SECTION: 2307ms
Thread count: 4
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
std::mutex:       1169ms
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
CRITICAL_SECTION: 1162ms
Thread count: 8
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
std::mutex:       640ms
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
CRITICAL_SECTION: 628ms
Thread count: 12
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
std::mutex:       745ms
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
results: 8.000000
CRITICAL_SECTION: 672ms
Anselmo answered 23/6, 2021 at 8:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.