Using xchg vs cmpxchg to aquire the lock
In terms of performance on Intel's processors, it is the same, but for the sake of simplicity, to have things easier to fathom, I prefer the first way from the examples that you have given. There is no reason to use cmpxchg
for acquiring a lock if you can do this with xchg
.
According to the Occam's razor principle, simple things are better.
Besides that, locking with xchg
is more powerful - you can also check the correctness of the logic of your software, i.e. that you are not accessing the memory byte that has not been explicitly allocated for locking. Thus, you will check that you are using the correctly initialized synchronization variable. Besides that, you will be able to check that you don't unlock twice.
Using normal memory store to release the lock
There is no consensus on whether writing to the synchronization variable on releasing the lock should be done with just a normal memory store (mov
) or a bus-locking memory store, i.e. an instruction with implicit or explicit lock
-prefix, like xchg
.
The approach of using normal memory store to release lock was recommended by Peter Cordes, see the comments below for details.
But there are implementations where both acquiring and releasing the lock is done with a bus-locking memory store, since this approach seems to be straightforward and intuitive. For example, LeaveCriticalSection under Windows 10 uses bus-locking store to release the lock even on a single-socket processor; while on multiple physical processors with Non-Uniform-Memory-Access (NUMA), this issue is even more important.
I've done a micro-benchmarking on a memory manager that does lots of memory allocate/reallocate/free, on a single-socket CPU (Kaby Lake). When there is no contention, i.e. there are fewer threads than the physical cores, with locked release the tests complete about 10% slower, but when there are more threads when physical cores, tests with locked release complete 2% faster. So, on average, normal memory store to release lock outperforms locked memory store.
Example of locking that checks synchronization variable for validity
See this example (Delphi programming language) of safer locking functions that checks data of a synchronization variable for validity, and catches attempts to release locks that were not acquired:
const
cLockAvailable = 107; // arbitrary constant, use any unique values that you like, I've chosen prime numbers
cLockLocked = 109;
cLockFinished = 113;
function AcquireLock(var Target: LONG): Boolean;
var
R: LONG;
begin
R := InterlockedExchange(Target, cLockByteLocked);
case R of
cLockAvailable: Result := True; // we've got a value that indicates that the lock was available, so return True to the caller indicating that we have acquired the lock
cLockByteLocked: Result := False; // we've got a value that indicates that the lock was already acquire by someone else, so return False to the caller indicating that we have failed to acquire the lock this time
else
begin
raise Exception.Create('Serious application error - tried to acquire lock using a variable that has not been properly initialized');
end;
end;
end;
procedure ReleaseLock(var Target: LONG);
var
R: LONG;
begin
// As Peter Cordes pointed out (see comments below), releasing the lock doesn't have to be interlocked, just a normal store. Even for debugging we use normal load. However, Windows 10 uses locked release on LeaveCriticalSection.
R := Target;
Target := cLockAvailable;
if R <> cLockByteLocked then
begin
raise Exception.Create('Serious application error - tried to release a lock that has not been actually locked');
end;
end;
Your main application goes here:
var
AreaLocked: LONG;
begin
AreaLocked := cLockAvailable; // on program initialization, fill the default value
....
if AcquireLock(AreaLocked) then
try
// do something critical with the locked area
...
finally
ReleaseLock(AreaLocked);
end;
....
AreaLocked := cLockFinished; // on program termination, set the special value to catch probable cases when somebody will try to acquire the lock
end.
Efficient pause-based spin-wait loops
Test, test-and-set
You may also use the following assembly code (see the "Assembly code example of pause-based spin-wait loop" section below) as a working example of the "pause"-based spin-wait loop.
This code it uses normal memory load while spinning to save resources, as suggested by Peter Cordes. This technique is called "test, test-and-set". You can find out more on this technique at https://mcmap.net/q/15058/-how-does-x86-pause-instruction-work-in-spinlock-and-can-it-be-used-in-other-scenarios
Number of iterations
The pause-based spin-wait loop in this example first tries to acquire the lock by reading the synchronization variable, and if it is not available, utilize pause
instruction in a loop of 5000 cycles. After 5000 cycles it calls Windows API function SwitchToThread(). This value of 5000 cycles is empirical. It is based on my tests. Values from 500 to 50000 also seem to be OK, but in some scenarios lower values are better while in other scenarios higher values are better. You can read more on pause-based spin-wait loops at the URL that I gave in the preceding paragraph.
Availability of the pause instruction
Please note that you may use this code only on processors that support SSE2 - you should check the corresponding CPUID bit before calling pause
instruction - otherwise there will just be a waste of power. On processors without pause
just use other means, like EnterCriticalSection/LeaveCriticalSection or Sleep(0) and then Sleep(1) in a loop. Some people say that on 64-bit processors you may not check for SSE2 to make sure that the pause
instruction is implemented, because the original AMD64 architecture adopted Intel's SSE and SSE2 as core instructions, and, practically, if you run 64-bit code, you have already have SSE2 for sure and thus the pause
instruction. However, Intel discourages a practice of relying on a presence specific feature and explicitly states that certain feature may vanish in future processors and applications must always check features via CPUID. However, the SSE instructions became ubiquitous and many 64-bit compilers use them without checking (e.g. Delphi for Win64), so chances that in some future processors there will be no SSE2, let alone pause
, are very slim.
Assembly code example of pause-based spin-wait loop
// on entry rcx = address of the byte-lock
// on exit: al (eax) = old value of the byte at [rcx]
@Init:
mov edx, cLockByteLocked
mov r9d, 5000
mov eax, edx
jmp @FirstCompare
@DidntLock:
@NormalLoadLoop:
dec r9
jz @SwitchToThread // for static branch prediction, jump forward means "unlikely"
pause
@FirstCompare:
cmp [rcx], al // we are using faster, normal load to not consume the resources and only after it is ready, do once again interlocked exchange
je @NormalLoadLoop // for static branch prediction, jump backwards means "likely"
lock xchg [rcx], al
cmp eax, edx // 32-bit comparison is faster on newer processors like Xeon Phi or Cannonlake.
je @DidntLock
jmp @Finish
@SwitchToThread:
push rcx
call SwitchToThreadIfSupported
pop rcx
jmp @Init
@Finish:
xchg
orcmpxchg
, to avoid having all the waiters hammering on the cache line and delaying the thread trying to unlock. (Use_mm_pause()
andatomic_load_explicit(lockaddr, memory_order_relaxed)
in the spinloop. Avoid having_mm_pause()
in the fast path). https://mcmap.net/q/18378/-locks-around-memory-manipulation-via-inline-assembly and https://mcmap.net/q/18379/-fastest-inline-assembly-spinlock. – Polypropylene