The example implementation Wikipedia provides for a spinlock with the x86 XCHG command is:
; Intel syntax
locked: ; The lock variable. 1 = locked, 0 = unlocked.
dd 0
spin_lock:
mov eax, 1 ; Set the EAX register to 1.
xchg eax, [locked] ; Atomically swap the EAX register with
; the lock variable.
; This will always store 1 to the lock, leaving
; the previous value in the EAX register.
test eax, eax ; Test EAX with itself. Among other things, this will
; set the processor's Zero Flag if EAX is 0.
; If EAX is 0, then the lock was unlocked and
; we just locked it.
; Otherwise, EAX is 1 and we didn't acquire the lock.
jnz spin_lock ; Jump back to the MOV instruction if the Zero Flag is
; not set; the lock was previously locked, and so
; we need to spin until it becomes unlocked.
ret ; The lock has been acquired, return to the calling
; function.
spin_unlock:
mov eax, 0 ; Set the EAX register to 0.
xchg eax, [locked] ; Atomically swap the EAX register with
; the lock variable.
ret ; The lock has been released.
from here https://en.wikipedia.org/wiki/Spinlock#Example_implementation
What I don't understand is why the unlock would need to be atomic. What's wrong with
spin_unlock:
mov [locked], 0
mov
should work, especially given that only the least significant bit is used in the variable. – Gravitatespin_unlock
a return value, 1 for success and 0 for an error because lock wasn't held. – Tramplelock
ed atomics (including the implicitlylock
edxchg
) have total order on x86, while ordinary stores only have release consistency. Of course, release semantics are enough for a spinlock, provided the acquire is done with alock
ed atomic. – Rosalbarosaleexchg
ideal? With counting locks it's much better to spin on just a load, and only try taking the lock if you see it become unlocked. Spinning onxchg
will potentially delay the unlocker'sxchg
from happening. If you don't write to the lock at all while it's locked, the core that owns the lock will still own the cache line when it tries to unlock, right? – Cyaneouslock
ed load is always followed by a locked store as far as externally-visible behaviour. But I think that optimization would be possible, as long as the memory-barrier effect still happened. (e.g.MFENCE
does it without a locked bus cycle). Might be worth testing with an experiment if you have the time. Can two threads can run at un-contended speed runningxchg [mem],eax
when[mem]=eax
? – Cyaneous