Lock-free multi-threading is for real threading experts

Asked 27/3, 2010 at 10:50 Answered 19/4, 2012 at 7:18

I was reading through an answer that Jon Skeet gave to a question and in it he mentioned this:

As far as I'm concerned, lock-free multi-threading is for real threading experts, of which I'm not one.

Its not the first time that I have heard this, but I find very few people talking about how you actually do it if you are interested in learning how to write lock-free multi-threading code.

So my question is besides learning all you can about threading, etc where do you start trying to learn to specifically write lock-free multi-threading code and what are some good resources.

Cheers

Inheritrix answered 27/3, 2010 at 10:50 Comment(3)

I use gcc, linux, and X86/X68 platforms. Lock-free is not nearly as hard as they all make it sound! The gcc atomic builtins have memory barriers on intel, but that doesn't matter in real life. What matters is that the memory is modified atomically. It just shakes out when you design "lock free" data structures that it doesn't matter when another thread sees a change. Single linked lists, skip lists, hash tables, free lists, etc are all pretty easy to do lock free. Lock free is not for everything. It's just another tool that is right for certain situations. – Mcnabb 4/6, 2012 at 21:54

1024cores.net – Martita 29/5, 2013 at 13:31

Voting to close as resource recommendation, or not clear what you're asking. – Genuflection 16/6, 2015 at 9:52

105

Current "lock-free" implementations follow the same pattern most of the time:

read some state and make a copy of it*
modify copy*
do an interlocked operation
retry if it fails

_{(*optional: depends on the data structure/algorithm)}

The last bit is eerily similar to a spinlock. In fact, it is a basic spinlock. :)
I agree with @nobugz on this: the cost of the interlocked operations used in lock-free multi-threading is dominated by the cache and memory-coherency tasks it must carry out.

What you gain however with a data structure that is "lock-free" is that your "locks" are very fine grained. This decreases the chance that two concurrent threads access the same "lock" (memory location).

The trick most of the time is that you do not have dedicated locks - instead you treat e.g. all elements in an array or all nodes in a linked list as a "spin-lock". You read, modify and try to update if there was no update since your last read. If there was, you retry.
This makes your "locking" (oh, sorry, non-locking :) very fine grained, without introducing additional memory or resource requirements.
Making it more fine-grained decreases the probability of waits. Making it as fine-grained as possible without introducing additional resource requirements sounds great, doesn't it?

Most of the fun however can come from ensuring correct load/store ordering.
Contrary to one's intuitions, CPUs are free to reorder memory reads/writes - they are very smart, by the way: you will have a hard time observing this from a single thread. You will, however run into issues when you start to do multi-threading on multiple cores. Your intuitions will break down: just because an instruction is earlier in your code, it does not mean that it will actually happen earlier. CPUs can process instructions out of order: and they especially like to do this to instructions with memory accesses, to hide main memory latency and make better use of their cache.

Now, it is sure against intuition that a sequence of code does not flow "top-down", instead it runs as if there was no sequence at all - and may be called "devil's playground". I believe it is infeasible to give an exact answer as to what load/store re-orderings will take place. Instead, one always speaks in terms of mays and mights and cans and prepare for the worst. "Oh, the CPU might reorder this read to come before that write, so it is best to put a memory barrier right here, on this spot."

Matters are complicated by the fact that even these mays and mights can differ across CPU architectures. It might be the case, for example, that something that is guaranteed to not happen in one architecture might happen on another.

To get "lock-free" multi-threading right, you have to understand memory models.
Getting the memory model and guarantees correct is not trivial however, as demonstrated by this story, whereby Intel and AMD made some corrections to the documentation of MFENCE causing some stir-up among JVM developers. As it turned out, the documentation that developers relied on from the beginning was not so precise in the first place.

Locks in .NET result in an implicit memory barrier, so you are safe using them (most of the time, that is... see for example this Joe Duffy - Brad Abrams - Vance Morrison greatness on lazy initialization, locks, volatiles and memory barriers. :) (Be sure to follow the links on that page.)

As an added bonus, you will get introduced to the .NET memory model on a side quest. :)

There is also an "oldie but goldie" from Vance Morrison: What Every Dev Must Know About Multithreaded Apps.

...and of course, as @Eric mentioned, Joe Duffy is a definitive read on the subject.

A good STM can get as close to fine-grained locking as it gets and will probably provide a performance that is close to or on par with a hand-made implementation. One of them is STM.NET from the DevLabs projects of MS.

If you are not a .NET-only zealot, Doug Lea did some great work in JSR-166.
Cliff Click has an interesting take on hash tables that does not rely on lock-striping - as the Java and .NET concurrent hash tables do - and seem to scale well to 750 CPUs.

If you are not afraid to venture into Linux territory, the following article provides more insight into the internals of current memory architectures and how cache-line sharing can destroy performance: What every programmer should know about memory.

@Ben made many comments about MPI: I sincerely agree that MPI may shine in some areas. An MPI based solution can be easier to reason about, easier to implement and less error-prone than a half-baked locking implementation that tries to be smart. (It is however - subjectively - also true for an STM based solution.) I would also bet that it is light-years easier to correctly write a decent distributed application in e.g. Erlang, as many successful examples suggest.

MPI, however has its own costs and its own troubles when it is being run on a single, multi-core system. E.g. in Erlang, there are issues to be solved around the synchronization of process scheduling and message queues.
Also, at their core, MPI systems usually implement a kind of cooperative N:M scheduling for "lightweight processes". This for example means that there is an inevitable context switch between lightweight processes. It is true that it is not a "classic context switch" but mostly a user space operation and it can be made fast - however I sincerely doubt that it can be brought under the 20-200 cycles an interlocked operation takes. User-mode context switching is certainly slower even in the the Intel McRT library. N:M scheduling with light-weight processes is not new. LWPs were there in Solaris for a long time. They were abandoned. There were fibers in NT. They are mostly a relic now. There were "activations" in NetBSD. They were abandoned. Linux had its own take on the subject of N:M threading. It seems to be somewhat dead by now.
From time to time, there are new contenders: for example McRT from Intel, or most recently User-Mode Scheduling together with ConCRT from Microsoft.
At the lowest level, they do what an N:M MPI scheduler does. Erlang - or any MPI system -, might benefit greatly on SMP systems by exploiting the new UMS.

I guess the OP's question is not about the merits of and subjective arguments for/against any solution, but if I had to answer that, I guess it depends on the task: for building low level, high performance basic data structures that run on a single system with many cores, either low-lock/"lock-free" techniques or an STM will yield the best results in terms of performance and would probably beat an MPI solution any time performance-wise, even if the above wrinkles are ironed out e.g. in Erlang.
For building anything moderately more complex that runs on a single system, I would perhaps choose classic coarse-grained locking or if performance is of great concern, an STM.
For building a distributed system, an MPI system would probably make a natural choice.
Note that there are MPI implementations for .NET as well (though they seem to be not as active).

Patency answered 27/3, 2010 at 10:51 Comment(28)

That's a typical way of converting code that relies on locks to be "lock-free". It's not at all typical of code designed from the ground up to avoid locks, which frequently use some form of producer/consumer queues for message passing. – Ancona 27/3, 2010 at 23:6

@Ben: That's an illusion, if you look deep - these systems use their own internal structures for message passing and resource scheduling. These structures use locking or the above typical "lock-free" techniques. As a good example, Erlang is built from the ground up with MPI in mind. It still uses locking at the lowest level. It turns out that it used one big fat lock for a global process queue. This created scalability problems, however. Now they plan to use multiple locks instead of one: #605683 – Patency 27/3, 2010 at 23:26

@Ben: ...so in the end, someone, somewhere, sometime will have to write a low-lock/"lock-free" queue implementation for Erlang - running into the same problems as above...;) – Patency 27/3, 2010 at 23:38

@andras: But only queues which have more than one writer require a lock (whether software lock, or hardware bus lock during an interlocked operation as you mentioned). Yes, there's some shared state written by multiple producers in any interesting system, but that can be made wait free (producer makes the receiver runnable, doesn't matter what the prior state was, doesn't matter whether someone else makes the receiver runnable as well). But more importantly, shared state is minimized in a system designed for message passing, so you don't pay for the cache sync very often at all. – Ancona 28/3, 2010 at 0:21

I guess what I'm saying is that some of the steps are the same. But it's possible to get rid of the (make a copy) and the (retry) steps with a lock-free wait-free message queue and those are the bulk of the costs of parallelism. – Ancona 28/3, 2010 at 0:24

@Ben: I think I see what you mean. You can in fact do something very similar with C++ on Windows 7, have a look at ConCRT: msdn.microsoft.com/en-us/library/dd998048(VS.100).aspx. It uses cooperative scheduling of lightweight tasks. It closely resembles MPI at the lowest levels. N:M scheduling with lightweight tasks was done before, it is not new. It has its own set of problems, however. en.wikipedia.org/wiki/…. – Patency 28/3, 2010 at 8:17

@andras: I didn't claim I had production-grade code for that. Note that feasibility and availability of production-ready code are very different things. Since my work is on medical devices, I'm much more concerned with correctness such as the deadlock-free characteristics of lock-free code than the performance, but I know some of the theory as well. A multiple-producer single-consumer queue can be built from a set of parallel single-producer queues and a shared event (to set an event from multiple threads requires no retry). If connectivity is sparse, naive polling may perform acceptably. – Ancona 28/3, 2010 at 16:31

(continued) If many producers must feed a single consumer, polling can be accomplished in log or log-log time with the use of flags for entire groups of queues. This doesn't scale very well in memory usage, but there's always some form of tradeoff. Since many systems can be designed with sparse connectivity, I understand there's a lot of applicability to this sort of queue implementation even if it isn't a universal solution. – Ancona 28/3, 2010 at 16:34

I guess I'm not explaining myself properly. One single scheduler-integrated event wakes the consumer. The consumer then has to discover which of the incoming queues contains data. For sparse connectivity, an exhaustive search (in linear time) is ok. That's what I called "polling". But the polling only happens when at least one queue has data, so it's not continually burning CPU. For larger numbers of incoming queues, flags indicate the potential presence of data in a group of queues, allowing the search to take place in log time. And groups of groups give even better scalability. – Ancona 28/3, 2010 at 16:57

@Ben: Be warned. It's only a braindump. There can be severe misunderstandings from my part about your ideas. :) When and how the wake-up event is reset? What kind of event will it be? E.g. the consumer wakes up, runs through the queues, finds all items that are queued and consumes them. The problem I see is that it cannot stop, since the event stays signaled - producers only set it - and the consumer does not see all the queues "at once" to know when to reset it and assert that: "now I have dequeued all items". New items could have arrived meanwhile and their signal would be lost. – Patency 28/3, 2010 at 19:32

@Ben: using a semaphore with a count would alleviate this, I think. But then again: the whole "wait-free"-ness is not proper in the first place imho. Taking your sentence: "to set an event from multiple threads requires no retry". True. It might however effectively serialize access to a shared memory location - so instead of a linear speedup and nice parallelization, your performance will be closer to the uniprocessor figure. (Meanwhile I have managed to find some queue implementation: thedelphigeek.com/2008/07/lock-free-queue-finally.html. So please, take my apologies on that. :) – Patency 28/3, 2010 at 19:41

@andras: I'm assuming an event which is reset atomically with wake. Maybe that isn't compatible with the wait-free requirement, but I think it can be done using InterlockedExchange (not InterlockedCompareExchange, which could fail and need retry). – Ancona 28/3, 2010 at 22:56

@Ben: Sadly, no matter how one twists it, in reality even InterLockedExchange will not be compatible with a strict wait-free-ness guarantee. It will not scale if multiple threads from multiple cores write to the same memory location. It waits. If you think about it, it is natural. After all, you really do want to serialize access. That is the most fundamental reason why e.g. any rw lock that uses a single memory location will not scale for fine-grained locking even if you have 0 writers, no matter what cunning lock xor, lock or, lock and, etc. operations they employ. – Patency 28/3, 2010 at 23:26

@Ben: e.g. bluebytesoftware.com/blog/2009/02/12/… – Patency 28/3, 2010 at 23:29

@Ben: in fact, it is written in the assembly code: "lock". :-) Now how the heck is someone supposed to build a lock-free, wait-free data structure out of that? :) (To preempt debates: xchg == lock xchg. It implies a lock. Doesn't matter if you write it down, or not.) – Patency 28/3, 2010 at 23:34

Lock free and wait free don't mean what you seem to think that they do. Lock free means the system can't completely stop progressing, while wait free means that no individual thread will stop progressing. Lock is used in a different sense than synchronization. – Burweed 31/5, 2011 at 21:50

@Joel: You are right in that they are not the same. I guess we've written down "wait-free" enough times with Ben. :P Please notice the ":)" at the end. I've found it rather funny/ironic how lock-free algorithms/structures on x86/x64 are peppered with an instruction that says: "lock". – Patency 1/6, 2011 at 18:8

@AndrasVass: Algorithms using Interlocked operations generally have the advantage that threads cannot be waylaid while they hold their "lock". Threads may have to wait indefinitely in the presence of resource contention from active threads, but will never have to wait for waylaid ones. Also, many CompareExchange algorithms guarantee that, no matter how much contention exists, someone will make forward progress (not all spinlock algorithms have that guarantee; some can livelock, with threads invalidating each others' progress even when they make none of their own). – Peculiar 15/6, 2012 at 15:4

@supercat: you are right. There is an important distinction between algorithms - or platform specific methods - that can guarantee progress and those that cannot. Certainly, x86/64 interlocked operations fall into the former - "guarantee" - group - even though they are not wait-free. – Patency 15/6, 2012 at 15:22

While this answer has a lot of good information, the headline idea that lock-free algorithms and data structures are essentially just a collection of very fine grained spinlocks is wrong. While you'll usually see retry loops in lock-free structures, the behavior is very different: locks (including spinlocks) exclusively acquire some resource and other threads cannot make progress while it is held. The "retry" in that sense is simply waiting for the exclusive resource to be released. – Sphacelus 1/9, 2017 at 20:44

Lock-free algorithms, on the other hand, don't use CAS or other atomic instructions to acquire an exclusive resource, but rather to complete some operation. If they fail, it is due to a temporally fine-grained race with another thread, and in that case the other thread made progress (completed its operation). If a thread is indefinitely suspect, all other threads can still make progress. This is both qualitatively and performance-wise very different from exclusive locks. The number of "retries" is usually very low for most CAS-loops even under heavy contention... – Sphacelus 1/9, 2017 at 20:46

... but that of course doesn't imply good scaling: contention for a single memory location is always going to be fairly slow on SMP machines, just due to inter-core inter-socket latencies, even if the number of CAS failures is low. – Sphacelus 1/9, 2017 at 20:47

@BeeOnRope: At the time of writing it felt more appropriate to introduce some ideas by using mostly known concepts, like, quote and quote, "spinlock". It is true that lock-free algorithms and data structures have distinctive advantages in many areas (user mode code where a thread can be suspended at will, kernel mode code where locking is not appropriate, low-latency solutions, read-heavy data structures, etc.), the question was not about the undoubted merits of lock-freedom and why it is better or worse, but rather, about its challenges. – Patency 5/9, 2017 at 17:2

@BeeOnRope: Also, please, feel free to edit. I can also make it a Community Wiki post if that serves the community better (e.g. it turns out that people get here from google searches on generic material for lock-free programming, which this post is not, despite some of the hopefully still useful links and material. Also, I am sure that some ideas in the post might have been relevant at a time, but "did not age well"..:) – Patency 5/9, 2017 at 17:4

@AndrasVass - I guess it depends also on "good" vs "bad" lock-free code. Certainly anyone can write a structure and call it lock-free while it really just uses a user-mode spinlock and doesn't even meet the definition. I would also encourage any interested readers to check out this paper from Herlihy and Shavit which looks in a formal way at the various categories of lock-based and lock-free algorithms. Anything by Herlihy on this topic is also recommended reading. – Sphacelus 5/9, 2017 at 18:17

@BeeOnRope: Shavit et al. also very successfully uses spinning and novel back-off approaches in their lock-free publications. I guess it would be hard to find non-trivial lock-free solutions that lack some spinning at least for handling inter-thread mutable state. The fact that this is almost a given in lock-free code, however, does not imply that this is all that it takes to implement a lock-free solution. It also does not make the converse true. Putting a spinlock in a piece of code does not magically makes it all lock-free. However, this was neither stated at all. – Patency 5/9, 2017 at 21:21

@AndrasVass - I disagree. Most of the classic lock-free structures (lists, queues, concurrent maps, etc) had no spinning even for shared mutable structures, and practical existing implementations of the same in, for example, Java follow the same pattern (I'm not as familiar with what's available in native-compiled C or C++ and it's harder there due to no garbage collection). Perhaps you and I have a different definition of spinning: I don't consider the "CAS-retry" you find in lock-free stuff "spinning". IMO "spinning" implies hot-waiting. – Sphacelus 5/9, 2017 at 21:27

The essence of a lock free data structure is that it does not simply "spin". If there is some thread that is suspended in the middle of an operation then other threads must not wait on that thread to complete before they move forward. Typically this is implemented by allowing a thread to complete another thread's operation for it. Williams' C++ concurrency in action does a good job of explaining lock-free data structures and their general design-process. – Midge 14/6, 2018 at 22:36

Joe Duffy's book:

http://www.bluebytesoftware.com/books/winconc/winconc_book_resources.html

He also writes a blog on these topics.

The trick to getting low-lock programs right is to understand at a deep level precisely what the rules of the memory model are on your particular combination of hardware, operating system, and runtime environment.

I personally am not anywhere near smart enough to do correct low-lock programming beyond InterlockedIncrement, but if you are, great, go for it. Just make sure that you leave lots of documentation in the code so that people who are not as smart as you don't accidentally break one of your memory model invariants and introduce an impossible-to-find bug.

Szabo answered 27/3, 2010 at 15:12 Comment(0)

There is no such thing as "lock-free threading" these days. It was an interesting playground for academia and the like, back in the end of the last century when computer hardware was slow and expensive. Dekker's algorithm was always my favorite, modern hardware has put it out to pasture. It doesn't work anymore.

Two developments have ended this: the growing disparity between the speed of RAM and the CPU. And the ability of chip manufacturers to put more than one CPU core on a chip.

The RAM speed problem required the chip designers to put a buffer on the CPU chip. The buffer stores code and data, quickly accessible by the CPU core. And can be read and written from/to RAM at a much slower rate. This buffer is called the CPU cache, most CPUs have at least two of them. The 1st level cache is small and fast, the 2nd is big and slower. As long as the CPU can read data and instructions from the 1st level cache it will run fast. A cache miss is really expensive, it puts the CPU to sleep for as many as 10 cycles if the data is not in the 1st cache, as many as 200 cycles if it isn't in the 2nd cache and it needs to be read from RAM.

Every CPU core has its own cache, they store their own "view" of RAM. When the CPU writes data the write is made to cache which is then, slowly, flushed to RAM. Inevitable, each core will now have a different view of the RAM contents. In other words, one CPU doesn't know what another CPU has written until that RAM write cycle completed and the CPU refreshes its own view.

That is dramatically incompatible with threading. You always really care what the state of another thread is when you must read data that was written by another thread. To ensure this, you need to explicitly program a so-called memory barrier. It is a low-level CPU primitive that ensures that all CPU caches are in a consistent state and have an up to date view of RAM. All pending writes have to flushed to RAM, the caches then need to be refreshed.

This is available in .NET, the Thread.MemoryBarrier() method implements one. Given that this is 90% of the job that the lock statement does (and 95+% of the execution time), you are simply not ahead by avoiding the tools that .NET gives you and trying to implement your own.

Chare answered 27/3, 2010 at 15:1 Comment(24)

I'm confused by your answer. Cache synchronization and contention of resources are only some of the penalties of OS provided locking. What about the cost of rescheduling? Not sure what your bottom line point is here – Tupelo 27/3, 2010 at 15:21

Not sure how rescheduling plays a role at all. The point of threading is to have a lock not block the thread 99.9% of the time. If it is a lot less then you don't get any value for the money. If it does block, there isn't anything you can do but wait. The lock statement already implements a spinwait. – Chare 27/3, 2010 at 15:37

Right I get that, I guess my question is why does "cache synchronization is slow" === "lock free is no better than locking"? I am new to lock free and like the sound of it but you seem to think its not super useful. – Tupelo 27/3, 2010 at 15:43

Maybe you could explain why you are considering "lock free threading". What do you hope to gain? If you think you can somehow make it more efficient then I failed to explain what the problem with the cpu cache is all about. – Chare 27/3, 2010 at 15:59

What about the benefit to consuming code of eliminating the possibility of deadlocks and reducing opportunities for threading-related coding errors (once the difficult part of setting up such a framework is in place)? I'm less interested in the performance aspect (as long as it's not significantly worse than locking) than I am in the simplification of writing correct multi-threaded code. If done correctly it should give boost the ability of the average coder to write correct multithreaded code, much like garbage collection reduces the complexities of memory management. – Menadione 27/3, 2010 at 16:56

@Davy8: composition makes it still hard. If I have two lock-free hash-tables and as a consumer I access both of them, this will not guarantee consistency of state as a whole. The closest you can come today is STMs where you can put the two accesses e.g. in a single atomic block. All in all, consuming lock-free structures can be just as tricky in many cases. – Patency 27/3, 2010 at 17:4

What definition are you using for lock free threading? Are Erlang's message passing or Software Transaction included, or not? – Imaret 27/3, 2010 at 17:22

@Rob: I believe that there is no magic going on here: STMs still utilize low-lock ("lock-free") techniques, such as lock striping or CAS+retry. I do not know much about Erlang internals, but message queues in a shared memory face the same problems. You have shared state - now it is the message queue that you have to lock or implement it with a "lock-free" structure. It is only the granularity and scope of the "lock" that differs. However using MPI or an STM can make a difference in terms of performance/usability - compared especially to an average, half-decent locking solution. – Patency 27/3, 2010 at 18:35

Message-passing systems generally scale far better than anything that uses locks. They have to be designed that way from the ground up, but they're generally easier to reason about (i.e. formal correctness proofs) and good use of partitioning usually makes the complexity of any single component lower as well. – Ancona 27/3, 2010 at 23:3

@Ben: I sincerely agree - what I wanted to point out is that as far as I know, you have to have some shared state somewhere in the system. True, it can be minimal in case of an MPI system. As far as I know, Erlang is built from the ground up with MPI in mind. It still uses locking at the lowest level. It turns out that it used one big fat lock for a global process queue. This created scalability problems, however. Now they plan to use multiple locks instead of one: #605683 – Patency 27/3, 2010 at 23:19

I may be wrong, but I think you've mis-explained how cache coherency works. Most modern multicore processors have coherent caches, which means that the cache hardware handles making sure that all processes have the same view of RAM contents -- by blocking "read" calls until all corresponding "write" calls have completed. The Thread.MemoryBarrier() documentation (msdn.microsoft.com/en-us/library/…) says nothing about cache behavior at all -- it's simply a directive that prevents the processor from reordering reads and writes. – Judicator 28/3, 2010 at 5:20

"There is no such thing as "lock-free threading" these days." Tell that to the Erlang and Haskell programmers. – Burgwell 28/3, 2010 at 15:39

@Brooks: modern x86 CPUs haven't maintained coherent caches for quite some time. The latest offerings from both Intel and AMD are highly non-uniform and trying to maintain coherency causes a severe performance hit. Instead cache coherency protocols are used to re-establish coherency at specific points during execution, and minimizing these is important to performance. Also, the register file is equivalent to another level of cache and it isn't even visible to cache coherency protocols. Ensuring that computations get flushed from registers to (cache) memory in the right order is critical. – Ancona 28/3, 2010 at 16:49

@Ben: Thanks for the correction. Do you have any suggestions for good references where I can read up on this a bit more? – Judicator 28/3, 2010 at 17:5

@Brooks: The wikipedia article en.wikipedia.org/wiki/MOESI_protocol has a basic discussion along with a bunch of links to other references, including the AMD engineering documents. Note that the mere existance of the "invalid" state implies that the cache isn't coherent. A coherent cache system (write-through) would snoop the write from the other processor and update the local cache, it wouldn't ever become invalid. – Ancona 28/3, 2010 at 17:31

@Ben: Thanks for the link. I think we must simply be talking about rather different meanings of the phrase "coherent cache" -- when the first line of the description of a caching protocol is, "This is a full cache coherency protocol...," I tend to think of it as coherent. :) Other than that difference of terminology, though, I was talking about MOESI-protocol mechanics, so we're on the same page there. – Judicator 29/3, 2010 at 6:11

The CPU does not sleep on a cache miss. There is a table of in-flight instructions, typically a few hundred instructions. If a single instruction happens to be blocked on a memory read of some kind, all the others continue. – Fennec 7/5, 2011 at 22:46

Hmm, no, no real cpu has a pipeline that's a few hundred instructions long. – Chare 7/5, 2011 at 23:32

@HansPassant: "There is no such thing as 'lock-free threading' these days". F#, Erlang, Haskell, Cilk, OCaml, Microsoft's Task Parallel Library (TPL) and Intel's Threaded Building Blocks (TBB) all encourage lock-free multithreaded programming. I rarely use locks in production code these days. – Encounter 15/7, 2013 at 10:51

@HansPassant: "a so-called memory barrier. It is a low-level CPU primitive that ensures that all CPU caches are in a consistent state and have an up to date view of RAM. All pending writes have to flushed to RAM, the caches then need to be refreshed". A memory barrier in this context prevents memory instructions (loads and stores) from being reordered by the compiler or CPU. Nothing to do with the consistency of CPU caches. – Encounter 15/7, 2013 at 10:54

@HansPassant: "the Thread.MemoryBarrier() method implements one. Given that this is 90% of the job that the lock statement does (and 95+% of the execution time)". The job of a lock is to provide mutual exclusion which is completely different to memory barriers preventing the reordering of memory instructions. Performance issues with locks come from contention. In contrast, memory barriers cannot be a source of contention because they just prevent memory instruction reordering on a single core. – Encounter 15/7, 2013 at 11:0

@BrooksMoses - you are correct. The idea that because " each CPU core has its own cache" they "they store their own "view" of RAM" and are out of sync is false on substantially all modern general purpose CPUs. All noteworthy CPUs are cache coherent which means that each CPU has the same view of RAM at the cache level. This achieved by protocols based roughly on MESI: a carefully organized state machine of probes, invalidations and other activity that ensure that every core has a consistent view of memory. – Sphacelus 1/9, 2017 at 20:54

The variations from sequential consistency that you do see (and which can be prevented by barriers) are due to effects "closer" to the CPU than the caches, such as store buffers and other types of memory speculation (e.g., satisfying later loads after a cache miss). It might seem like splitting hairs, but it is important to understand that caching is not generally the cause of inconsistent views of memory across CPUs - and that's one reason why barriers take only a dozen cycles, rather than the tens of thousands that would be needed to "flush the caches" under this answer's interpretation. – Sphacelus 1/9, 2017 at 20:57

@Fennec is correct: out-of-order exec can see a couple hundred instructions past a cache miss, tracked by the ROB. This is separate from the number of stages in the pipeline (length), see blog.stuffedcow.net/2013/05/measuring-rob-capacity / realworldtech.com/sandy-bridge/10. Hiding cache-miss latency instead of stalling is one of the primary reasons for out-of-order exec. CPUs can even do later loads, with hit-under-miss and miss-under-miss creating memory-level parallelism by having the CPU tracking up to 12 (Skylake) incoming / outgoing cache lines in parallel. – Benadryl 16/4, 2022 at 21:5

Google for lock free data structures and software transactional memory.

I'll agree with John Skeet on this one; lock-free threading is the devil's playground, and best left up to people who know that they know what they need to know.

Maureenmaureene answered 27/3, 2010 at 10:54 Comment(0)

When it comes to multi-threading you have to know exactly what you are doing. I mean explore all the possible scenarios/cases that might occur when you are working in a multi-threaded environment. Lock-free multithreading is not a library or a class which we incorporate, its a knowledge/experience that we earn during our journey on threads.

Doubleedged answered 27/3, 2010 at 10:54 Comment(2)

There are numerous libraries that provide lock-free threading semantics. STM is of particular interest, of which there are quite a number of implementations around. – Maureenmaureene 27/3, 2010 at 10:56

I see both sides of this one. Getting effective performance out of a lock-free library requires deep knowledge of memory models. But a programmer who doesn't have that knowledge can still benefit from the correctness advantages. – Ancona 28/3, 2010 at 16:44

Even though lock-free threading may be difficult in .NET, often you can make significant improvements when using a lock by studying exactly what needs to be locked, and minimizing the locked section... this is also known as minimizing the lock granularity.

As an example, just say you need to make a collection thread safe. Don't just blindly throw a lock around a method iterating over the collection if it performs some CPU-intensive task on each item. You might only need to put a lock around creating a shallow copy of the collection. Iterating over the copy could then work without a lock. Of course this is highly dependent on the specifics of your code, but I have been able to fix a lock convoy issue with this approach.

Ichthyo answered 19/4, 2012 at 7:18 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags