Is Critical Section always faster?

Asked 12/5, 2009 at 15:16 Answered 23/2, 2021 at 10:17

Solved c++winapi synchronization critical-section

I was debugging a multi-threaded application and found the internal structure of CRITICAL_SECTION. I found data member LockSemaphore of CRITICAL_SECTION an interesting one.

It looks like LockSemaphore is an auto-reset event (not a semaphore as the name suggests) and operating system creates this event silently when first time a thread waits on Critcal Section which is locked by some other thread.

Now, I am wondering is Critical Section always faster? Event is a kernel object and each Critical section object is associated with event object then how Critical Section can be faster compared to other kernel objects like Mutex? Also, how does internal event object actually affects the performance of Critical section ?

Here is the structure of the CRITICAL_SECTION:

struct RTL_CRITICAL_SECTION
{
    PRTL_CRITICAL_SECTION_DEBUG DebugInfo;
    LONG LockCount;
    LONG RecursionCount;
    HANDLE OwningThread;
    HANDLE LockSemaphore;
    ULONG_PTR SpinCount;
};

Ontiveros answered 12/5, 2009 at 15:16 Comment(2)

Also remember the CriticalSection implementation details may vary from version to version of the OS. It can be instructive to look at these details, but don't rely on them, as they may change. – Anthropophagite 13/5, 2009 at 22:10

It certainly is a Semaphore in NT5 – Lancey 2/6, 2014 at 9:28

When they say that a critical section is "fast", they mean "it's cheap to acquire one when it isn't already locked by another thread".

[Note that if it is already locked by another thread, then it doesn't matter nearly so much how fast it is.]

The reason why it's fast is because, before going into the kernel, it uses the equivalent of InterlockedIncrement on one of those LONG field (perhaps on the the LockCount field) and if it succeeds then it considers the lock aquired without having gone into the kernel.

The InterlockedIncrement API is I think implemented in user mode as a "LOCK INC" opcode ... in other words you can acquire an uncontested critical section without doing any ring transition into the kernel at all.

Lowman answered 12/5, 2009 at 16:12 Comment(2)

+1 Nice explanation. Effectively you can test yourself if you can acquire the critical section by reading LockCount. – Enceladus 29/7, 2012 at 12:0

@Arno: there is TryEnterCriticalSection for that. – Jazmin 23/3, 2015 at 13:53

In performance work, few things fall into the "always" category :) If you implement something yourself that is similar to an OS critical section using other primitives then odds are that will be slower in most cases.

The best way to answer your question is with performance measurements. How OS objects perform is very dependent on the scenario. For example, critical sections are general considered 'fast' if contention is low. They are also considered fast if the lock time is less than the spin count time.

The most important thing to determine is if contention on a critical section is the first order limiting factor in your application. If not, then simply use a critical section normaly and work on your applications primary bottleneck (or necks).

If critical section performance is critical, then you can consider the following.

Carefully set the spin lock count for your 'hot' critical sections. If performance is paramount, then the work here is worth it. Remember, while the spin lock does avoid the user mode to kernel transition, it consumes CPU time at a furious rate - while spinning, nothing else gets to use that CPU time. If a lock is held for long enough, then the spinning thread will actual block, freeing up that CPU to do other work.
If you have a reader/writer pattern then consider using the Slim Reader/Writer (SRW) locks. The downside here is they are only available on Vista and Windows Server 2008 and later products.
You may be able to use condition variables with your critical section to minimize polling and contention, waking threads only when needed. Again, these are supported on Vista and Windows Server 2008 and later products.
Consider using Interlocked Singly Linked Lists (SLIST)- these are efficient and 'lock free'. Even better, they are supported on XP and Windows Server 2003 and later products.
Examine your code - you may be able to break up a 'hot' lock by refactoring some code and using an interlocked operation, or SLIST for synchronization and communication.

In summary - tuning scenarios that have lock contention can be challenging (but interesting!) work. Focus on measuring your applications performance and understanding where your hot paths are. The xperf tools in the Windows Performance Tool kit is your friend here :) We just released version 4.5 in the Microsoft Windows SDK for Windows 7 and .NET Framework 3.5 SP1 (ISO is here, web installer here). You can find the forum for the xperf tools here. V4.5 fully supports Win7, Vista, Windows Server 2008 - all versions.

Hawes answered 12/5, 2009 at 15:53 Comment(0)

CriticalSections is faster, but InterlockedIncrement/InterlockedDecrement is more. See this implementation usage sample LightweightLock full copy.

Burgoo answered 12/5, 2009 at 16:38 Comment(2)

LightWeightLock may be slower since it is a spinlock. It can also deadlock in the case of priority inversion. – Chromic 23/12, 2011 at 17:23

I tend to use the Interlocked-functions (more specifically InterlockedCompareExchange) to guard small blocks of code which don't call other functions. In my case I noticed that the Interlocked-functions were twice as fast as the CriticalSection functions (which was very important in my case since it was a small block that was very often executed). Disadvantage of the Interlocked-funtions is that you cannot nest them and you can accidentially mis-use them (e.g. lock in one thread, unlock in another thread). – Lift 13/11, 2012 at 14:26

The CriticalSections will spin a short while (few ms) and keep checking if the lock is free. After the spin count 'times out', it will then fall back to the kernel event. So in the case where the holder of the lock gets out quickly, you never have to make the expensive transition to kernel code.

EDIT: Went and found some comments in my code: apparently the MS Heap Manager uses a spin count of 4000 (integer increments, not ms)

Wolframite answered 12/5, 2009 at 15:28 Comment(4)

Isn't 4 second is bit too much? Also, the event object gets created the moment OS finds the critical section is locked. it is bit overhead, right? – Ontiveros 12/5, 2009 at 15:36

wow, 4 seconds is huge in computer terms... did you perhaps mean 4000 microseconds (i.e. 4ms)? I don't think a context switch takes even 4ms on a modern machine. Can you provide a citation of the 4sec figure? – Goshorn 12/5, 2009 at 15:47

Actualy, the spin count value is just a count - it is not a value expressed in units of time. The 4,000 value comes from the MSDN documentation for InitializeCriticalSectionAndSpinCount() Note that this may change from release to release. The docs say 'roughly 4,000'. – Hawes 12/5, 2009 at 15:57

4000 cycles in a loop is just a handful of microsecond. – Martingale 26/8, 2012 at 19:16

Here's a way to look at it:

If there's no contention, then the spin lock is really fast compared to going to kernel mode for a Mutex.

When there is contention, a CriticalSection is slightly more expensive than using a Mutex directly (because of the extra work to detect the spinlock state).

So it boils down to a weighted average, where the weights depend on the specifics of your calling pattern. That being said, if you have little contention, then a CriticalSection is big win. If, on the other hand, you consistently have lots of contention, then you'd be paying a very small penalty over using a Mutex directly. But in that case, what you'd gain by switching to a Mutex is small, so you'd probably be better off trying to reduce the contention.

Anthropophagite answered 13/5, 2009 at 22:18 Comment(2)

Set the SpinCount to 0 and it is at least never slower then a mutex even under heavy contention. – Martingale 26/8, 2012 at 19:22

Haha, and it is the same as a mutex! – Leitmotiv 1/8, 2014 at 11:41

Critical section is faster than mutex why because critical section is not a kernel object. This is part of global memory of the current process. Mutex actually resides in Kernel and creation of mutext object requires a kernel switch but in case of critical section not. Even though critical section is fast, there will be a kernel switch while using critical section when threads are going to wait state. This is because thread scheduling happens in kernel side.

Pindaric answered 6/1, 2015 at 9:20 Comment(0)

From my experience and experiments,CRITICAL_SECTION is extremely slow compared to pthreads implementation.

Extremely means around 10 times slower for switching threads when number of locks/unlocks is big, when comparing the same code with pthread implementation.

I thus never use Critical Section again; pthreads are also available on MS Windows and finally the performance nightmares are over.

Chifley answered 23/2, 2021 at 10:17 Comment(2)

What is the name of the "the pthreads implementation" i.e. the exact name of the function that you're talking about? – Lowman 2/3, 2023 at 7:45

I am referring to standard UN*X pthreads library that has been ported to MS Windows – Chifley 20/11, 2023 at 13:27

Recommended topics

Hot tags