How expensive is the lock statement?
Asked Answered
S

7

131

I've been experimenting with multi threading and parallel processing and I needed a counter to do some basic counting and statistic analysis of the speed of the processing. To avoid problems with concurrent use of my class I've used a lock statement on a private variable in my class:

private object mutex = new object();

public void Count(int amount)
{
 lock(mutex)
 {
  done += amount;
 }
}

But I was wondering... how expensive is locking a variable? What are the negative effects on performance?

Suggestive answered 12/1, 2011 at 20:20 Comment(5)
Locking the variable isn't that expensive; it's the waiting on a locked variable that you want to avoid.Perjured
it's a lot less expensive than spending hours on tracking down another race condition ;-)Shoon
Well... if a lock is expensive you might want to avoid them by changing the programming so that it needs fewer locks. I could implement some kind of synchronization.Suggestive
I had a dramatic improvement in performance (right now, after reading @Perjured 's comment) just by moving a lot of code out of my lock blocks. Bottomline: from now on I'll leave only the variable access (usually one line) inside a lock block, sort of a "just in time locking". Does it make sense?Washington
@Washington Of course it makes sense. It should be also architectural principle, you are supposed to make locks as short, simple and fast as possible. Only really necessary data that need to be synchronized. On server boxes, you should also take into consideration hybrid nature of the lock. Contention even if not critical for your code is thanks to hybrid nature of the lock causing cores to spin during each access if lock is held by someone else. You are effectively devouring some cpu resources from other services on the server for some time before your thread gets suspended.Statism
R
103

Here is an article that goes into the cost. Short answer is 50ns.

Rossierossing answered 12/1, 2011 at 20:22 Comment(10)
So in conclusion the more objects you have the more expensive it gets.Suggestive
Short better answer: 50ns + time spent waiting if other thread is holding lock.Stylistic
The more threads are entering and leaving lock, the more expensive it gets. The cost expands exponentially with the number of threadsIronwork
Some context: dividing two numbers on a 3Ghz x86 takes about 10ns (not including the time it takes to fetch/decode the instruction); and loading a single variable from (non-cached) memory into a register takes about 40ns. So 50ns is insanely, blindingly fast - you shouldn't worry about the cost of using lock any more than you'd worry about the cost of using a variable.Sternmost
Also, that article was old when this question was asked.Jacobah
Results from the article do not apply to server environment. On multi-socket servers the cost will be even higher, as each CPU directly deals only with it's own memory. So, readers, keep this in mind.Tenerife
Running the test in single-mode on a corei3: With locks = 3215 milliseconds Without locks = 479 milliseconds Difference = 2736 milliseconds 100000000 locks requires 2735 milliseconds Lock requires 0.02735 microsecondsClupeoid
Running test in mutithreaded-mode on the same machine: With locks = 6474 milliseconds Without locks = 476 milliseconds Difference = 5998 milliseconds 100000000 locks requires 5997 milliseconds Lock requires 0.05997 microsecondsClupeoid
It seems almost no cost on modern machines.Clupeoid
Really great metric, "almost no cost", not to mention incorrect. You guys do not take into consideration, that it is short and fast only and ONLY if there is no contention at all, one thread. IN SUCH CASE, you DO NOT NEED LOCK AT ALL. Second issue, lock is not lock, but hybrid lock, it detects inside CLR that lock is not held by anyone based on atomic operations and in such case, it avoids calls to operating system core, that is different ring which is not measured by these tests. What is measured as 25ns to 50ns is actually application level interlocked instructions code if lock is not takenStatism
G
54

The technical answer is that this is impossible to quantify, it heavily depends on the state of the CPU memory write-back buffers and how much data that the prefetcher gathered has to be discarded and re-read. Which are both very non-deterministic. I use 150 CPU cycles as a back-of-the-envelope approximation that avoids major disappointments.

The practical answer is that it is waaaay cheaper than the amount of time you'll burn on debugging your code when you think you can skip a lock.

To get a hard number you'll have to measure. Visual Studio has a slick concurrency analyzer available as an extension.

Griswold answered 12/1, 2011 at 20:34 Comment(2)
Actually no, it can be quantified and measured. It just is not as easy as writing those locks all around the code, then stating that it is all just 50ns, a myth measured on single threaded access to the lock.Statism
"think you can skip a lock"... I think that's where a lot of people are at when they read this question...Adkison
S
35

Further reading:

I would like to present few articles of mine, that are interested in general synchronization primitives and they are digging into Monitor, C# lock statement behavior, properties, and costs depending on distinct scenarios and number of threads. It is specifically interested about CPU wastage and throughput periods to understand how much work can be pushed through in multiple scenarios:

https://www.codeproject.com/Articles/1236238/Unified-Concurrency-I-Introduction https://www.codeproject.com/Articles/1237518/Unified-Concurrency-II-benchmarking-methodologies https://www.codeproject.com/Articles/1242156/Unified-Concurrency-III-cross-benchmarking

Original answer:

Oh dear!

It seems that correct answer flagged here as THE ANSWER is inherently incorrect! I would like to ask the author of the answer, respectfully, to read the linked article to the end. article

The author of the article from 2003 article was measuring on Dual Core machine only and in the first measuring case, he measured locking with a single thread only and the result was about 50ns per lock access.

It says nothing about a lock in the concurrent environment. So we have to continue reading the article and in the second half, the author was measuring locking scenario with two and three threads, which gets closer to concurrency levels of today's processors.

So the author says, that with two threads on Dual Core, the locks cost 120ns, and with 3 threads it goes to 180ns. So it seems to be clearly dependent on the number of threads accessing the lock concurrently.

So it is simple, it is not 50 ns unless it is a single thread, where the lock gets useless.

Another issue for consideration is that it is measured as average time!

If the time of iterations would be measured, there would be even times between 1ms to 20ms, simply because the majority was fast, but few threads will be waiting for processors time and incur even milliseconds long delays.

This is bad news for any kind of application which requires high throughput, low latency.

And the last issue for consideration is that there could be slower operations inside the lock and very often that is the case. The longer the block of code is executed inside the lock, the higher the contention is and delays rise sky high.

Please consider, that over one decade has passed already from 2003, that is few generations of processors designed specifically to run fully concurrently and locking is considerably harming their performance.

Statism answered 27/9, 2015 at 6:21 Comment(2)
To clarify, the article is not saying the lock performance degrades with the number of threads in the application; performance degrades with the number of threads contending over the lock. (That is implied, but not clearly stated, in the answer above.)Journalistic
I presume you mean this: "So it seems to be clearly dependent on the number of concurrently accessed threads and more is worse." Yes, the wording could be better. I meant "concurrently accessed" as threads concurrently accessing the lock, thus creating contention.Statism
U
23

This doesn't answer your query about performance, but I can say that the .NET Framework does offer an Interlocked.Add method that will allow you to add your amount to your done member without manually locking on another object.

Uncontrollable answered 12/1, 2011 at 20:23 Comment(4)
Yes, this is probably the best answer. But mainly for reason of shorter and cleaner code. The difference in speed is not likely to be noticeable.Vitriolic
thanks for this answer. I'm doing more stuff with locks. Added ints is one of many. Love the suggestion, will use it from now on.Suggestive
locks are much, much easier to get right, even if lock-free code is potentially faster. Interlocked.Add on its own has the same issues as += with no synchronization.Selmaselman
Lock-free isn't "potentially faster". It can be orders of magnitudes faster in extremely tight, long-running, concurrent loops.Viaduct
V
10

lock (Monitor.Enter/Exit) is very cheap, cheaper than alternatives like a Waithandle or Mutex.

But what if it was (a little) slow, would you rather have a fast program with incorrect results?

Vitriolic answered 12/1, 2011 at 20:25 Comment(3)
Haha... I was going for the fast program and the good results.Suggestive
@henk-holterman There are multiple issues with your statements: First as this question and answers clearly showed, there is low understanding of impacts of lock on the overall performance, even people stating myth about 50ns which is applicable only with single-threaded environment. Second your statement is here and will stay for years and in mean time, processors grown in cores, but speed of cores does not so much.**Thrid** applications become only more complex over time, and then it is layer upon layer of locking in environment of many cores and the number is rising,2,4,8,10,20,16,32Statism
My usual approach is to build synchronization in loosely coupled way with as little interaction as possible. That goes very fast to lock-free data structures. I made for my code wrappers around spinlock to simplify development and even when TPL has special concurrent collections, I have developed spin locked collections of my own around list, array, dictionary and queue, as I needed little more control and sometimes some code running under spinlock. I can tell you, it is possible and allows to solve multiple scenarios TPL collections can not do and with great performance/throughput gain.Statism
V
7

The cost for a lock in a tight loop, compared to an alternative with no lock, is huge. You can afford to loop many times and still be more efficient than a lock. That is why lock free queues are so efficient.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace LockPerformanceConsoleApplication
{
    class Program
    {
        static void Main(string[] args)
        {
            var stopwatch = new Stopwatch();
            const int LoopCount = (int) (100 * 1e6);
            int counter = 0;

            for (int repetition = 0; repetition < 5; repetition++)
            {
                stopwatch.Reset();
                stopwatch.Start();
                for (int i = 0; i < LoopCount; i++)
                    lock (stopwatch)
                        counter = i;
                stopwatch.Stop();
                Console.WriteLine("With lock: {0}", stopwatch.ElapsedMilliseconds);

                stopwatch.Reset();
                stopwatch.Start();
                for (int i = 0; i < LoopCount; i++)
                    counter = i;
                stopwatch.Stop();
                Console.WriteLine("Without lock: {0}", stopwatch.ElapsedMilliseconds);
            }

            Console.ReadKey();
        }
    }
}

Output:

With lock: 2013
Without lock: 211
With lock: 2002
Without lock: 210
With lock: 1989
Without lock: 210
With lock: 1987
Without lock: 207
With lock: 1988
Without lock: 208
Veinstone answered 29/9, 2013 at 19:16 Comment(1)
This might be a bad example because your loop really doesn't do anything, apart from a single variable assignment and a lock is at least 2 function calls. Also, 20ns per lock you are getting isn't that bad.Josefinejoseito
M
6

There are a few different ways to define "cost". There is the actual overhead of obtaining and releasing the lock; as Jake writes, that's negligible unless this operation is performed millions of times.

Of more relevance is the effect this has on the flow of execution. This code can only be entered by one thread at a time. If you have 5 threads performing this operation on a regular basis, 4 of them will end up waiting for the lock to be released, and then to be the first thread scheduled to enter that piece of code after that lock is released. So, your algorithm is going to suffer significantly. How much so depends on the algorithm and how often the operation is called.. You can't really avoid it without introducing race conditions, but you can ameliorate it by minimizing the number of calls to the locked code.

Machinist answered 12/1, 2011 at 20:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.