Relative performance of swap vs compare-and-swap locks on x86
Asked Answered
S

5

31

Two common locking idioms are:

if (!atomic_swap(lockaddr, 1)) /* got the lock */

and:

if (!atomic_compare_and_swap(lockaddr, 0, val)) /* got the lock */

where val could simply be a constant or an identifier for the new prospective owner of the lock.

What I'd like to know is whether there tends to be any significant performance difference between the two on x86 (and x86_64) machines. I know this is a fairly broad question since the answer might vary a lot between individual cpu models, but that's part of the reason I'm asking SO rather than just doing benchmarks on a few cpus I have access to.

Seism answered 17/3, 2011 at 13:37 Comment(3)
+1 just for braving the "profile it!" and "don't even think about it unless it's a proven bottleneck!" comments :-)Frontispiece
Just to make it worse, I'd not be surprised that the contention and the nature of the parallelism available (one multi-core processor or multiple processors) may be also an important factor.Burstone
If the fast-path fails to get the lock, your spin-loop should check read-only before retrying xchg or cmpxchg, to avoid having all the waiters hammering on the cache line and delaying the thread trying to unlock. (Use _mm_pause() and atomic_load_explicit(lockaddr, memory_order_relaxed) in the spinloop. Avoid having _mm_pause() in the fast path). https://mcmap.net/q/18378/-locks-around-memory-manipulation-via-inline-assembly and https://mcmap.net/q/18379/-fastest-inline-assembly-spinlock.Polypropylene
M
16

I assume atomic_swap(lockaddr, 1) gets translated to a xchg reg,mem instruction and atomic_compare_and_swap(lockaddr, 0, val) gets translated to a cmpxchg[8b|16b].

Some linux kernel developers think cmpxchg ist faster, because the lock prefix isn't implied as with xchg. So if you are on a uniprocessor, multithread or can otherwise make sure the lock isn't needed, you are probably better of with cmpxchg.

But chances are your compiler will translate it to a "lock cmpxchg" and in that case it doesn't really matter. Also note that while latencies for this instructions are low (1 cycle without lock and about 20 with lock), if you happen to use are common sync variable between two threads, which is quite usual, some additional bus cycles will be enforced, which last forever compared to the instruction latencies. These will most likely completly be hidden by a 200 or 500 cpu cycles long cache snoop/sync/mem access/bus lock/whatever.

Marylou answered 17/3, 2011 at 15:11 Comment(1)
+1 and thanks for the link. I want to accept an answer but I'm not sure which one to go with - neither really covers the topic completely but both have good info.Seism
A
15

I found this Intel document, stating that there is no difference in practice:

http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/

One common myth is that the lock utilizing a cmpxchg instruction is cheaper than a lock utilizing an xchg instruction. This is used because cmpxchg will not attempt to get the lock in exclusive mode since the cmp will go through first. Figure 9 shows that the cmpxchg is just as expensive as the xchg instruction.

Aeroplane answered 17/3, 2011 at 15:28 Comment(3)
+1 for the link. That myth makes no sense though. I would naively assume the xchg would be cheaper, just because (1) it's less powerful, and (2) cmpxchg has "more to do" while the bus lock (or whatever mechanism) is held.Seism
The myth makes perfect sense. xchg always locks the bus, so it's normally (without the lock prefix) slower. But for atomic operations you'll need to use "lock cmpxchg" which is just as slow as "lock xchg". And there isn't really more for the CPU to do, not really, both are "load something save", the slow part is the load and save, the something (swap or compare and swap) takes no time at all in CPU. In fact, cmpxchg can be faster if it fails the comparison, e.g. it won't need to save.Miter
@Martin: lock cmpxchg could be faster if it worked that way, but it doesn't. Failed CAS still writes the old value. (Does cmpxchg write destination cache line on failure? yes). Historically that was so external devices would only ever see the #LOCK external signal asserted across a read + write, and Pentium's new lock cmpxchg instruction (and the undocumented 486 version) didn't want to break that. Now it's maybe just for CPU internals, except maybe on cache-line-split where there's still a bus lockPolypropylene
T
3

On x86, any instruction with a LOCK prefix does all memory operations as read-modify-write cycles. This means that XCHG (with its implicit LOCK) and LOCK CMPXCHG (in all cases, even if the comparison fails) always get an exclusive lock on the cache line. The result is that there is basically no difference in performance.

Note that many CPUs all spinning on the same lock can cause a lot of bus overhead in this model. This is one reason that spin-lock loops should contain PAUSE instructions. Some other architectures have better operations for this.

Transfuse answered 17/8, 2012 at 7:13 Comment(1)
The spin loop that retries if the fast-path finds the lock already locked should use a read-only check (separated by PAUSE instructions) before trying an xchg. Putting pause inside a loop that keeps hammering away at the cache line with xchg is only a minor improvement (especially on Intel pre-Skylake where pause only sleeps for ~5 cycles instead of ~100). Note that Skylake makes it important not to pause in the no-contention fast-path, so use a separate loop.Polypropylene
K
2

Are you sure you didn't mean

 if (!atomic_load(lockaddr)) {
       if (!atomic_swap(lockaddr, val)) /* got the lock */

for the second one?

Test and test and set locks (see Wikipedia https://en.wikipedia.org/wiki/Test_and_test-and-set ) are a quite common optimization for many platforms.

Depending on how compare and exchange is implemented it could be faster or slower than a test and test and set.

As x86 is a relatively stronger ordered platform HW optimizations that may make test and test and set locks faster may be less possible.

Figure 8 from the document that Bo Persson found http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/ shows that Test and Test and Set locks are superior in performance.

Kraken answered 24/4, 2017 at 21:44 Comment(2)
Yup, spinning on a read-only check is good. Many cores hammering the same cache line with xchg will delay the unlocking thread. Even better is having PAUSE (_mm_pause) in the spin loop (but not in the fast-path).Polypropylene
He might not have. The old 386 version of the lock was you swapped a 1 in, but only own the lock if you got a zero back out.Wootan
M
2

Using xchg vs cmpxchg to aquire the lock

In terms of performance on Intel's processors, it is the same, but for the sake of simplicity, to have things easier to fathom, I prefer the first way from the examples that you have given. There is no reason to use cmpxchg for acquiring a lock if you can do this with xchg.

According to the Occam's razor principle, simple things are better.

Besides that, locking with xchg is more powerful - you can also check the correctness of the logic of your software, i.e. that you are not accessing the memory byte that has not been explicitly allocated for locking. Thus, you will check that you are using the correctly initialized synchronization variable. Besides that, you will be able to check that you don't unlock twice.

Using normal memory store to release the lock

There is no consensus on whether writing to the synchronization variable on releasing the lock should be done with just a normal memory store (mov) or a bus-locking memory store, i.e. an instruction with implicit or explicit lock-prefix, like xchg.

The approach of using normal memory store to release lock was recommended by Peter Cordes, see the comments below for details.

But there are implementations where both acquiring and releasing the lock is done with a bus-locking memory store, since this approach seems to be straightforward and intuitive. For example, LeaveCriticalSection under Windows 10 uses bus-locking store to release the lock even on a single-socket processor; while on multiple physical processors with Non-Uniform-Memory-Access (NUMA), this issue is even more important.

I've done a micro-benchmarking on a memory manager that does lots of memory allocate/reallocate/free, on a single-socket CPU (Kaby Lake). When there is no contention, i.e. there are fewer threads than the physical cores, with locked release the tests complete about 10% slower, but when there are more threads when physical cores, tests with locked release complete 2% faster. So, on average, normal memory store to release lock outperforms locked memory store.

Example of locking that checks synchronization variable for validity

See this example (Delphi programming language) of safer locking functions that checks data of a synchronization variable for validity, and catches attempts to release locks that were not acquired:

const
  cLockAvailable = 107; // arbitrary constant, use any unique values that you like, I've chosen prime numbers
  cLockLocked    = 109;
  cLockFinished  = 113;

function AcquireLock(var Target: LONG): Boolean; 
var
  R: LONG;
begin
  R := InterlockedExchange(Target, cLockByteLocked);
  case R of
    cLockAvailable: Result := True; // we've got a value that indicates that the lock was available, so return True to the caller indicating that we have acquired the lock
    cLockByteLocked: Result := False; // we've got a value that indicates that the lock was already acquire by someone else, so return False to the caller indicating that we have failed to acquire the lock this time
      else
        begin
          raise Exception.Create('Serious application error - tried to acquire lock using a variable that has not been properly initialized');
        end;
    end;
end;

procedure ReleaseLock(var Target: LONG);
var
  R: LONG;
begin
  // As Peter Cordes pointed out (see comments below), releasing the lock doesn't have to be interlocked, just a normal store. Even for debugging we use normal load. However, Windows 10 uses locked release on LeaveCriticalSection.
  R := Target;
  Target := cLockAvailable;
  if R <> cLockByteLocked  then
  begin
    raise Exception.Create('Serious application error - tried to release a  lock that has not been actually locked');
  end;
end;

Your main application goes here:

var
  AreaLocked: LONG;
begin
  AreaLocked := cLockAvailable; // on program initialization, fill the default value

  .... 
 
 if AcquireLock(AreaLocked) then
 try
   // do something critical with the locked area
   ... 

 finally
   ReleaseLock(AreaLocked); 
 end;

....

  AreaLocked := cLockFinished; // on program termination, set the special value to catch probable cases when somebody will try to acquire the lock

end.

Efficient pause-based spin-wait loops

Test, test-and-set

You may also use the following assembly code (see the "Assembly code example of pause-based spin-wait loop" section below) as a working example of the "pause"-based spin-wait loop.

This code it uses normal memory load while spinning to save resources, as suggested by Peter Cordes. This technique is called "test, test-and-set". You can find out more on this technique at https://mcmap.net/q/15058/-how-does-x86-pause-instruction-work-in-spinlock-and-can-it-be-used-in-other-scenarios

Number of iterations

The pause-based spin-wait loop in this example first tries to acquire the lock by reading the synchronization variable, and if it is not available, utilize pause instruction in a loop of 5000 cycles. After 5000 cycles it calls Windows API function SwitchToThread(). This value of 5000 cycles is empirical. It is based on my tests. Values from 500 to 50000 also seem to be OK, but in some scenarios lower values are better while in other scenarios higher values are better. You can read more on pause-based spin-wait loops at the URL that I gave in the preceding paragraph.

Availability of the pause instruction

Please note that you may use this code only on processors that support SSE2 - you should check the corresponding CPUID bit before calling pause instruction - otherwise there will just be a waste of power. On processors without pause just use other means, like EnterCriticalSection/LeaveCriticalSection or Sleep(0) and then Sleep(1) in a loop. Some people say that on 64-bit processors you may not check for SSE2 to make sure that the pause instruction is implemented, because the original AMD64 architecture adopted Intel's SSE and SSE2 as core instructions, and, practically, if you run 64-bit code, you have already have SSE2 for sure and thus the pause instruction. However, Intel discourages a practice of relying on a presence specific feature and explicitly states that certain feature may vanish in future processors and applications must always check features via CPUID. However, the SSE instructions became ubiquitous and many 64-bit compilers use them without checking (e.g. Delphi for Win64), so chances that in some future processors there will be no SSE2, let alone pause, are very slim.

Assembly code example of pause-based spin-wait loop

// on entry rcx = address of the byte-lock
// on exit: al (eax) = old value of the byte at [rcx]
@Init:
   mov  edx, cLockByteLocked
   mov  r9d, 5000
   mov  eax, edx
   jmp  @FirstCompare
@DidntLock:
@NormalLoadLoop:
   dec  r9
   jz   @SwitchToThread // for static branch prediction, jump forward means "unlikely"
   pause
@FirstCompare:
   cmp  [rcx], al       // we are using faster, normal load to not consume the resources and only after it is ready, do once again interlocked exchange
   je   @NormalLoadLoop // for static branch prediction, jump backwards means "likely"
   lock xchg [rcx], al
   cmp  eax, edx        // 32-bit comparison is faster on newer processors like Xeon Phi or Cannonlake.
   je   @DidntLock
   jmp  @Finish
@SwitchToThread:
   push  rcx
   call  SwitchToThreadIfSupported
   pop   rcx
   jmp  @Init
@Finish:
Minima answered 6/7, 2017 at 22:20 Comment(31)
You don't need xchg to release a lock. Just a release-store (i.e. a normal x86 store) is sufficient. Even for debugging, it's probably sufficient to do a normal read on the lock to see if it was already unlocked before storing cLockByteAvailable.Polypropylene
The important thing for performance is to spin on a read-only check to see if the lock is available, instead of spinning on InterlockedExchange. Many cores hammering the same cache line with xchg will delay the unlocking thread.Polypropylene
@PeterCordes - thank you, I will test this on real application and will let you know the difference in performance, and then will update the answer.Minima
@PeterCordes - do you mean that just normal read + pause is better for spinning?Minima
Yes. I wrote a simple asm spinlock example a while ago in another SO answer which uses a pause+load spin loop in the contended case. It avoids storing to the cache line at all until it sees that the lock is available. (A real implementation would want some kind of backoff to prevent a stampede of threads all trying to take the lock at once when it becomes available.)Polypropylene
@PeterCordes - Peter, don't you know an article that discusses the issue that releasing a lock doesn't need a locked write, just a normal write? What if on two processors with NUMA, the second processor will notice much later that the lock is available if we release it with just a normal write - it will be cached by a write-back cache on one processor and would not get to the second processor as quick as with the locked write?Minima
It's possible that an xchg to release a lock might make it visible slightly sooner, but definitely with greater overhead in the releasing thread. A normal store can't stay invisible for long, though. NUMA machines are still cache-coherent, and what you describe (reading a stale value from L1 in one core while L1 in another core contains a different dirty value) is a violation of cache coherency. For the store to commit to L1D, all other copies of the cache line on all cores of all sockets must be Invalid. Multi-socket systems snoop the other socket when they miss in their own L3.Polypropylene
glibc pthreads uses a normal x86 store in pthread_spin_unlock. A CPU does try to commit stores to L1D as soon as possible, to free up space in the store buffer. I think all you'd get from using xchg is blocking execution of unrelated stores/loads in code after the critical section until the store becomes globally visible. But maybe xchg could make the unlocking thread refuse to release the cache line if another thread tried to read the lock while xchg was runningPolypropylene
I think that possible benefit isn't worth the overhead of a full memory barrier (and the other costs of xchg, including just being a lot more uops) in the unlock path, since like I said, it can't be delayed for long. Especially if the threads waiting to take the lock are spinning on a load, not an xchg.Polypropylene
@PeterCordes - thank you for the explanation, Peter!Minima
@PeterCordes - I have figured out that CriticalSection in Windows 10 uses lock cmpxchg to release a critical section, not just a normal store.Minima
But it's not just a spinlock. Does it need that to synchronize the spin part with the OS-assisted part? Or do you think it uses it where it could have used a normal store? Or is it using it to detect double-unlock?Polypropylene
@PeterCordes - it uses locked store where it could have used a normal store. If you just write EnterCriticalSection and then LeaveCritical section, just for test, and there will be no threads also entering this section in between, the release will be a locked store. It doesn't detect double unlock.Minima
@PeterCordes - I've done a micro-benchmarking on a memory manager that does lots of memory allocate/reallocate/free, on a single-socket CPU (Kaby Lake). When there is no contention, i.e. there are fewer threads than the physical cores, with locked release the tests complete about 10% slower, but when there are more threads when physical cores, tests with locked release complete 2% faster. I will have to do tests on a computer with multiple sockets, maybe it will be more than 2% faster.Minima
Interesting. I'd guess it depends on the surrounding code as well as the hardware + contention level. (A memory barrier would have more impact if it prevented out-of-order execution from getting started on stuff after the release by blocking a load). In your test, were the threads waiting to take the lock spinning on a read-only load + pause, or are they spinning on xchg?Polypropylene
@PeterCordes - this code I've measured turned to be out irrelevant - it was a wrapper around EnterCriticalSection. When I've written another code, without any critical section, just with pause and SwitchToThread(), the normal store to release lock was faster in all cases. I have updated the answer to show the usage example of locking, so the unlocking with normal store is always faster.Minima
You don't need to check CPUID before using pause. On old CPUs, it runs as a NOP. Having a separate version of the spin-loop that leaves out the rep nop will make such a tiny difference in power consumption that it's not worth bothering.Polypropylene
It looks like your function runs pause before checking the lock even the first time through. On Skylake, pause sleeps for ~100 cycles, up from ~5, so you definitely want to avoid running it in the fast-path through your function to make it cheap to acquire a lock in the no-contention case. See my asm implementation here, and note that the spinloop is a separate block that doesn't run unless we found the lock already locked.Polypropylene
@PeterCordes - I had a code that did lock xchg before, and if it didn't aquire the lock, it went to the code above where it started with a pause. So it made lock xchg at start. I think that your approach of first doing a normal load is more reasonable (I have added the jmp to @FirstCompare) do address the issue that you have raised.Minima
Now you have a taken-branch on the fast-path. It's avoidable at the minor cost of maybe some code duplication. Also, what happens if the byte in memory isn't cLockByteLocked or 0? It looks to me like you will eventually lock with whatever you found there instead of cLockByteLocked, if one of your xchg attempts fails and leaves that byte in al. Also, why are you using 64-bit r9 and r10 instead of 32-bit r10d? Some assemblers will assemble mov r10, 1 to a 7-byte mov r/m64, imm32. Or avoiding REX prefixes altogether with esi / edi. Also, inc would be as fast therePolypropylene
@PeterCordes - I’ve don’t tests: lock xchg straight ahead vs “normal load” first. The results are mixed. In some scenarios, straight lock xchg without any normal load before is faster (the whole application completes faster), while on other normal load first is faster to complete the application. There is no clear answer on what is faster. Probably, normal load is sometimes slower due to branch prediction failures or due to slower cache update.Minima
@PeterCordes - thank you, I have updated the code. The function returns existing byte in AL, so the higher level code will check it and raise an exception.Minima
Interesting, thanks for testing. Normal load before xchg could also be slower if it causes a memory-ordering mis-speculation pipeline clear. (There's a performance counter event for that, maybe worth looking at.)Polypropylene
Also, I'd suggest you remove the suggestion to check CPUID. An extra NOP on CPUs that don't support pause makes no difference, and in that case you just have no option but to spin. (Such CPUs are so rare that they're barely worth worrying about.) If you're going to warn about anything for old CPUs, warn that cmp eax, edx will cause a partial-register stall on Nehalem and earlier!! (On SnB, it just inserts a merging uop without stalling, and on HSW and later it merges for free.) Workaround: xor eax,eax / mov al, cLockByteLocked so the reg will have its upper-bytes-zero flag set.Polypropylene
@PeterCordes - maybe we should remove that edx altogether and use cmp al, cLockByteLocked instead or cmp eax, edx?Minima
yup, that's what I'd recommend. It's still a 2-byte instruction because of the special cmp al, imm8 opcode.Polypropylene
Intel discourages a practice of relying on a presence specific feature You keep going on about this. On a CPU that doesn't support pause, it decodes as REP NOP, which is guaranteed to run as NOP. People did use spinlocks before pause existed, so it's fine, especially if you fall back to an OS-supported locking method after 5k spins like you're doing. (Although really, using an OS-supported lock in the first place instead of rolling your own is probably the best plan, if you pick one that tries in user-space first before making a system call.)Polypropylene
@PeterCordes - thank you! I'm working on a FastMM4 memory manager fork - github.com/maximmasiutin/FastMM4 - benchmarks show that it is now almost twice faster when there are multiple threads, especially when there are more threads than physical cores.Minima
@PeterCordes - I think it's better to either just use EnterCriticalSectin/LeaveCriticalSection at start, or use a spin-lock. I don't like the idea of reverting to EnterCriticalSectin after 5000 loops. My testings have shown that it is very slow. Surprisinlgy, spin-loops with Sleep(0) (just one time) and then Sleep(1) shows good results when the number of threads is very high - the application completes quickly. I just don't like the idea of spinning with NOP if the CPU doesn't support PAUSE. Although, branching related to SSE2 (to check whether pause is supported) may not worth its price.Minima
All CPUs with hyperthreading support pause, so on a CPU without pause you're just using a bit more power. All such CPUs are so old that they use a lot of power all the time anyway. With or without pause, you're still occupying a CPU core and not letting anything else run. If any thread ever holds the lock for very long, it's bad to keep spinning. Especially if you have more threads than logical cores (or some other load), since you could use up a whole timeslice spinning while the thread holding the lock is waiting for a CPU. Exponential randomized backoff is supposed to be good tooPolypropylene
I agree that falling back to EnterCriticalSection after 5k spins is probably not good. preshing.com/20111124/always-use-a-lightweight-mutex says that EnterCriticalSection tries to lock purely in user-space before falling back to OS-supported sleep / notify-wakeup if it doesn't get the lock quickly. In Linux, pthreads locks are similar, I think. They try a user-space spinlock before falling back to a Linux futex.Polypropylene

© 2022 - 2024 — McMap. All rights reserved.