x86 spinlock using cmpxchg
Asked Answered
A

3

16

I'm new to using gcc inline assembly, and was wondering if, on an x86 multi-core machine, a spinlock (without race conditions) could be implemented as (using AT&T syntax):

spin_lock:
mov 0 eax
lock cmpxchg 1 [lock_addr]
jnz spin_lock
ret

spin_unlock:
lock mov 0 [lock_addr]
ret
Aruwimi answered 4/8, 2011 at 2:15 Comment(0)
X
26

You have the right idea, but your asm is broken:

cmpxchg can't work with an immediate operand, only registers.

lock is not a valid prefix for mov. mov to an aligned address is atomic on x86, so you don't need lock anyway.

It has been some time since I've used AT&T syntax, hope I remembered everything:

spin_lock:
    xorl   %ecx, %ecx
    incl   %ecx            # newVal = 1
spin_lock_retry:
    xorl   %eax, %eax      # expected = 0
    lock; cmpxchgl %ecx, (lock_addr)
    jnz    spin_lock_retry
    ret

spin_unlock:
    movl   $0,  (lock_addr)    # atomic release-store
    ret

Note that GCC has atomic builtins, so you don't actually need to use inline asm to accomplish this:

void spin_lock(int *p)
{
    while(!__sync_bool_compare_and_swap(p, 0, 1));
}

void spin_unlock(int volatile *p)
{
    asm volatile ("":::"memory"); // acts as a memory barrier.
    *p = 0;
}

As Bo says below, locked instructions incur a cost: every one you use must acquire exclusive access to the cache line and lock it down while lock cmpxchg runs, like for a normal store to that cache line but held for the duration of lock cmpxchg execution. This can delay the unlocking thread especially if multiple threads are waiting to take the lock. Even without many CPUs, it's still easy and worth it to optimize around:

void spin_lock(int volatile *p)
{
    while(!__sync_bool_compare_and_swap(p, 0, 1))
    {
        // spin read-only until a cmpxchg might succeed
        while(*p) _mm_pause();  // or maybe do{}while(*p) to pause first
    }
}

The pause instruction is vital for performance on HyperThreading CPUs when you've got code that spins like this -- it lets the second thread execute while the first thread is spinning. On CPUs which don't support pause, it is treated as a nop.

pause also prevents memory-order mis-speculation when leaving the spin-loop, when it's finally time to do real work again. What is the purpose of the "PAUSE" instruction in x86?

Note that spin locks are actually rarely used: typically, one uses something like a critical section or futex. These integrate a spin lock for performance under low contention, but then fall back to an OS-assisted sleep and notify mechanism. They may also take measures to improve fairness, and lots of other things the cmpxchg / pause loop doesn't do.


Also note that cmpxchg is unnecessary for a simple spinlock: you can use xchg and then check whether the old value was 0 or not. Doing less work inside the locked instruction may keep the cache line pinned for less time. See Locks around memory manipulation via inline assembly for a complete asm implementation using xchg and pause (but still with no fallback to OS-assisted sleep, just spinning indefinitely.)

Xyloid answered 4/8, 2011 at 2:36 Comment(6)
Should the parameter for void spin_lock() also be declared volatile?Aruwimi
No. __sync_bool_compare_and_swap already treats it as volatile.Xyloid
The asm used as memory barrier inside spin_unlock should probably include memory clobber. Though on the other hand, there is __sync_lock_release which is designed just to do the "write barrier, and write 0" thing without needing to think about asm at all, and it is even "somewhat portable". It doesn't explicitly work as read barrier (it incidentially does on the target architecture), but that's ok. The worst thing to happen is another thread doing a single extra spin in a rare, unlikely case.Aldose
I think the actual spinlock should be implemented in as short a sequence as possible. Since we can lock when the vlock value is 0 (we replace it with 1 and get the 0 back) a more natural sequence would be to call the lock spinlock_failed which would be true when we get a 1 in return i e the lock failed. Additional functionality can then be build around spinlock_failed with retries etc.Kanara
Can you elaborate further with regards to your comment on pause instr please? Should I include a new label that directs to a pause instr, which in turn jumps to spin_lock_retry?Barnet
You should spin on an a pure load (with pause in the loop), not lock cmpxchg, like you do in C. Your C version should probably be a do{}while() loop to pause after cmpxchg failure before the first try of a load. I'm not sure if that could cause a memory-order mis-speculation (which pause avoids), but if so it flushes the whole pipeline affecting both hyperthreads, not just the one spinning.Seamark
F
2

This will put less contention on the memory bus:

void spin_lock(int *p)
{
    while(!__sync_bool_compare_and_swap(p, 0, 1)) while(*p);
}
Femoral answered 16/10, 2012 at 21:22 Comment(2)
Agreed, though this code isn't so good. A simple while(*p) can easily be optimized out by the compiler. Add some barriers. Also, adding _mm_pause() for Intel chips can significantly improve performance.Xyloid
@Bo. This needs volatile or _Atomic. The loop will optimize to if(*p) { while(1); }, i.e. infinite loop if ever entered.Seamark
C
0

The syntax is wrong. It works after a little modification.

spin_lock:
    movl $0, %eax
    movl $1, %ecx
    lock cmpxchg %ecx, (lock_addr)
    jnz spin_lock
    ret
spin_unlock:
    movl $0, (lock_addr)
    ret

To provide a code running faster. Assume lock_addr is store in %rdi redister.

Use movl and test instead of lock cmpxchgl %ecx, (%rdi) to spin.

Use lock cmpxchgl %ecx, (%rdi) for trying to enter critical section only if there's a chance.

Then could avoid unneeded bus locking.

spin_lock:
    movl $1, %ecx
loop:
    movl (%rdi), %eax
    test %eax, %eax
    jnz loop
    lock cmpxchgl %ecx, (%rdi)
    jnz loop
    ret
spin_unlock:
    movl $0, (%rdi)
    ret

I have tested it using pthread and an easy loop like this.

for(i = 0; i < 10000000; ++i){
    spin_lock(&mutex);
    ++count;
    spin_unlock(&mutex);
}

In my test, the first one take 2.5~3 secs and the second one take 1.3~1.8 secs.

Checkbook answered 23/12, 2019 at 12:41 Comment(1)
If you're going to tweak it for efficiency, also use pause to avoid a memory-order mis-speculation pipeline nuke on the iteration that leaves the read-only loop. (But you really want to avoid running pause on the no-contention fast path so you'd have to re-arrange the branching.) e.g. like this answer: Locks around memory manipulation via inline assemblySeamark

© 2022 - 2024 — McMap. All rights reserved.