atomic_compare_exchange vs mutex

Asked 20/9, 2012 at 15:25 Answered 20/9, 2012 at 15:42

Solved c++multithreading mutex atomic lock-free

What is the point in replacing a mutex lock with block like this

void stack_push(stack* s, node* n)
{
    node* head;
    do
    {
        head = s->head;
        n->next = head;
    }
    while ( ! atomic_compare_exchange(s->head, head, n));
}

Can't understand what benefit we can get by replacing mutex with this atomic excange?

Blenny answered 20/9, 2012 at 15:25 Comment(11)

I added a language tag. Please correct if not right. – Marcy 20/9, 2012 at 15:27

The point is that the present code doesn't have locks. – Sammysamoan 20/9, 2012 at 15:37

@KerrekSB: ... but it has a loop instead? Why is a lock generally assumed to be slower? – Milden 20/9, 2012 at 15:41

@Mehrdad: So what? At least one thread is going to break out of the loop, guaranteed. – Sammysamoan 20/9, 2012 at 15:42

@KerrekSB: And the same is not true with a lock? – Milden 20/9, 2012 at 15:44

You might want a read memory barrier before the load of s->head. Not strictly necessary but I suspect it improves performance. – Wilbertwilborn 20/9, 2012 at 15:44

@Mehrdad: No. Everyone has to wait for the thread that holds the lock, which may be delayed indefinitely (or have died). – Sammysamoan 20/9, 2012 at 15:45

@KerrekSB: A thread having died is a moot point -- correctly written code shouldn't do that. And regarding everyone waiting for a lock: is there really an upper time bound for how long a CAS can take? Can't that also take a relatively long time in certain cases? – Milden 20/9, 2012 at 15:48

CAS latency is long as instructions go, but it's peanuts compared to the work the OS has to do to run a mutex. We're talking a single very very slow instruction, verses tons of code and functions calls and a context switch into the kernel. – Wilbertwilborn 20/9, 2012 at 15:52

You probably really need: atomic_compare_exchange(&s->head, &head, n) – Austin 20/9, 2012 at 16:4

just read Anthony Williams "C++ concurrency in action". a mutex is usually implemented by an atomic boolean flag, but with other atomic types you can do more and avoid the locking. In high-contention situations, mutexes can be critically slow. – Pitchblack 26/9, 2012 at 12:22

It is typically faster than a mutex. That being said, you cannot just simply replace all mutexes with a CAS. A single CAS will swap one reference with another safely among many threads.

If you have a compound function in which one write depends on another read (for example), you would need a mutex to ensure atomicity.

Rosabella answered 20/9, 2012 at 15:37 Comment(1)

Couldn't you use a CAS with memory_order_acq_rel to emulate a lock & have the correct memory ordering? – Ranjiv 23/11, 2022 at 2:34

There are a number of advantages;

it's a lot quicker (on Windows, like 10x or 100x - not so much on Linux, like 10% better)
it scales MUCH better (although still not enough - only to about 100 logical cores)
it's MUCH cooler and you seem far more intelligent and capable
where no waits or sleeps are required, this code can be used in places where waits or sleeps are forbidden, e.g. interrupt handlers, certain parts of the Windows (DISPATCH_LEVEL) and Linux kernels, etc

Wilbertwilborn answered 20/9, 2012 at 15:42 Comment(9)

Btw why is it slower on Linux? – Milden 20/9, 2012 at 16:26

It isn't. What differs is how quick the mutex is. On Windows, mutexes are slooooow. On Linux, they're damn quick (I think they must be user-mode rather than kernel-mode). So on Windows, the lock-free method is a LOT quicker. On Linux, it's only a bit quicker. – Wilbertwilborn 20/9, 2012 at 16:37

Ahh interesting. Kinda curious how you could have a user-mode-only mutex though... seems like you'd need to cross process boundaries. – Milden 20/9, 2012 at 16:38

@mehrdad: I believe linux's mutexes are built upon a futex, which is an atomic integer shared between processes, and a wait queue in the kernel. As implemented, it only has to go into the kernel when collisions occur. en.wikipedia.org/wiki/Futex – Cinchonism 20/9, 2012 at 17:6

@DaveS: Oh, but that's only when there's no contention! Isn't that what Windows does too? This page says, "EnterCriticalSection will not enter the kernel unless there is contention on the lock. If there is no contention on the lock, then the API will obtain the lock in the user space and return without entering privileged mode. If there is contention, then it will follow very similar paths as WaitForSingleObject within the kernel." How is Linux faster? – Milden 20/9, 2012 at 17:27

@Mehrdad: I would recommend testing both implementations then. I'm wondering if the concept that 'Windows mutexes are slow' are based on an earlier implementation, or one of those things that might have been true before, but aren't true now. – Cinchonism 20/9, 2012 at 17:38

Windows has something which really is called a mutex and I meant that. A critical section in Windows is a different entity. – Wilbertwilborn 20/9, 2012 at 18:2

You forgot the negative points: mutexes are a lot simpler to code and to get right. Lock-free coding is a mine field and so easy to get wrong (in particular if you want to really squeeze the best performance out of it by not using sequentially consistent memory order). What's most annoying in multi-threading: that you cannot validate the code by running simple tests, because a multi-threaded code is not deterministic (even serial code isn't, but it's much closer). – Pitchblack 26/9, 2012 at 12:30

@Walter, That's why you have to test your code. The only issue might be if it's incorrect, but the failure happens intermittently with a very low likelihood, such that you need lots of repeats of the tests to catch it. But the same issue can happen with mutexes in that there is a deadlock scenario with low likelihood but that can still happen. – Washing 23/3 at 21:14

It is typically faster than a mutex. That being said, you cannot just simply replace all mutexes with a CAS. A single CAS will swap one reference with another safely among many threads.

If you have a compound function in which one write depends on another read (for example), you would need a mutex to ensure atomicity.

Rosabella answered 20/9, 2012 at 15:37 Comment(1)

Couldn't you use a CAS with memory_order_acq_rel to emulate a lock & have the correct memory ordering? – Ranjiv 23/11, 2022 at 2:34

Recommended topics

Hot tags