Locks around memory manipulation via inline assembly

Asked 15/5, 2016 at 17:37 Answered 16/5, 2016 at 3:25

I am new to the low level stuff so I am completely oblivious of what kind of problems you might face down there and I am not even sure if I understand the term "atomic" right. Right now I am trying to make simple atomic locks around memory manipulation via extended assembly. Why? For sake of curiosity. I know I am reinventing the wheel here and possibly oversimplifying the whole process.

The question? Does the code I present here achive the goal of making memory manipulation both threadsafe and reentrant?

If it works, why?
If it doesn't work, why?
Not good enough? Should I for example make use of the register keyword in C?

What I simply want to do...

Before memory manipulation, lock.
After memory manipulation, unlock.

The code:

volatile int atomic_gate_memory = 0;

static inline void atomic_open(volatile int *gate)
{
    asm volatile (
        "wait:\n"
        "cmp %[lock], %[gate]\n"
        "je wait\n"
        "mov %[lock], %[gate]\n"
        : [gate] "=m" (*gate)
        : [lock] "r" (1)
    );
}

static inline void atomic_close(volatile int *gate)
{
    asm volatile (
        "mov %[lock], %[gate]\n"
        : [gate] "=m" (*gate)
        : [lock] "r" (0)
    );
}

Then something like:

void *_malloc(size_t size)
{
        atomic_open(&atomic_gate_memory);
        void *mem = malloc(size);
        atomic_close(&atomic_gate_memory);
        return mem;
}
#define malloc(size) _malloc(size)

.. same for calloc, realloc, free and fork(for linux).

#ifdef _UNISTD_H
int _fork()
{
        pid_t pid;
        atomic_open(&atomic_gate_memory);
        pid = fork();
        atomic_close(&atomic_gate_memory);
        return pid;
}
#define fork() _fork()
#endif

After loading the stackframe for atomic_open, objdump generates:

00000000004009a7 <wait>:
4009a7: 39 10                   cmp    %edx,(%rax)
4009a9: 74 fc                   je     4009a7 <wait>
4009ab: 89 10                   mov    %edx,(%rax)

Also, given the disassembly above; can I assume I am making an atomic operation because it is only one instruction?

Vermis answered 15/5, 2016 at 17:37 Comment(14)

No, it's not thread safe because two threads could simultanously run the cmp and assume they can take the lock. – Auric 15/5, 2016 at 17:56

@Auric Oh, snap... I had somehow assumed that CPU only executed one set of instruction at a time, interleaving with different sets of instructions when it is multicored... This really complicates things... – Vermis 15/5, 2016 at 18:9

Interleaving (multitasking) also causes the same problem. After one thread does the cmp the next thread might get the cpu and also does its cmp. – Auric 15/5, 2016 at 18:11

This means there is no way to use global lock for memory? If that is the case, then you should make that as an answer. I will accept it and give a thumbs up. In the meantime, I am going to have to rethink my strategy... – Vermis 15/5, 2016 at 18:14

Obviously there are ways. The recommended one is not to use assembly, but if you do want to use it, you should utilize atomic instructions such as lock cmpxchg – Auric 15/5, 2016 at 18:15

Well, it was not obvious to me but I am getting the picture in my head. I guess I still have lot more reading to do.. If you want to make this your answer, I will of course accept it. Thanks alot! I really appreciate it :) – Vermis 15/5, 2016 at 18:26

Why would you need the locks around fork? You shouldn't need them for malloc either, because the malloc implementation on Linux is thread safe, and so already has necessary locks. The fork system call doesn't change any state in the process that's visible to other threads, so there's nothing to protect with a lock. – Essive 15/5, 2016 at 18:39

I was simply curious about locks and wanted practice. As for malloc, there are many implementations on different platforms so I thought it would be nice to be able to implement locks. It is also good skill to have for future platforms as well. As for fork, I was worried about deadlocks because I have assumed some threads are going to be dead, which will mean some states will never be released. Am I wrong? – Vermis 15/5, 2016 at 18:57

The locks you put around fork only affect threads that call your fork wrapper and only during the time they call that wrapper. It has no effect on other threads and any state they may be responsible for. The state of the parent process is unchanged by the fork call, but the child process has only one thread, a clone of the thread that called fork. Because you don't know when they were interrupted by the fork call, state owned by other threads would be effectively be in an undefined state in the child process, but your locks around fork don't change this. – Essive 15/5, 2016 at 19:19

My attempt was faulty and Jester has already established that. Ignoring the fact that my approach doesn't work, I did have reasons for using locks around fork. One of the reason is exactly the one you pointed out. Only one thread is going to live in the child process. I know the state is the same. The focal point is the fact that other threads in the child process are dead. Last time I checked, any action on heap is non-reentrant. This means I have to make sure they are not running when fork is executed and vice verca so the state is not going to be left undefined. Does this make sense? – Vermis 15/5, 2016 at 20:0

Though since you're using the same lock for both the fork and malloc wrapper, it would in theory let you call malloc safely in the child process if you reinitialize the lock in the child process, since you know that the fork didn't interrupt a malloc call in another thread. The problem is in practice its not only your own code that can call malloc, all sorts of other library functions can call malloc and those calls won't be protected by your wrapper. Fortunately all this is unnecessary in practice since Linux's glibc uses pthread_atfork to lock itself across forks. – Essive 15/5, 2016 at 20:19

Ahh, I now see your point. Aside from GPU libraries, I am goofing around writing my own libraries so I was stuck in that box. Damn... This changes things... Especially if these GPU libraries have their own wrappers... I going to bang my head against the wall now. Have a nice day sir and have my thumbs up. – Vermis 15/5, 2016 at 20:31

See Jeff Preshing's blog to learn about memory ordering semantics. You can learn from compiler output, too: Write some simple functions with C++11 std::atomic, and look at the asm output (e.g. on gcc.godbolt.org). Also, I answered a recent question about rolling your own semaphore counting-locks, with C11 atomics. I talked some about how it compiles into x86 code. – Blanchette 16/5, 2016 at 2:21

See also https://mcmap.net/q/18509/-spinlock-with-xchg-unlocking. But that implementation has bad performance: it spins on lock xchg, instead of on a normal load, and doesn't use the pause instruction. – Blanchette 16/5, 2016 at 2:49

I think a simple spinlock that doesn't have any of the really major / obvious performance problems on x86 is something like this. Of course a real mutex implementation would use a system call (like Linux futex) after spinning for a while, and unlocking would have to check if it needs to notify any waiters with another system call. This is important; you don't want to spin forever wasting CPU time (and energy / heat) doing nothing. But conceptually this is the spin part of a mutex before you take the fallback path. It's an important piece of how light-weight locking is implemented. (Only attempting to take the lock once before calling the kernel would be a valid choice, instead of spinning at all.)

Implement as much of this as you like in inline asm, or preferably using C11 stdatomic, like this semaphore implementation. This is NASM syntax. If using GNU C inline asm, make sure you use a "memory" clobber to stop compile-time reordering of memory access. But don't use inline asm; use C _Atomic uint8_t or C++ std::atomic<uint8_t> with .exchange(1, std::memory_order_acquire) and .store(0, std::memory_order_release), and _mm_pause() from immintrin.h.

;;; UNTESTED ;;;;;;;;
;;; TODO: **IMPORTANT** fall back to OS-supported sleep/wakeup after spinning some
;;; e.g. Linux futex
    ; first arg in rdi as per AMD64 SysV ABI (Linux / Mac / etc)

;;;;;void spin_lock  (volatile char *lock)
global spin_unlock
spin_unlock:
       ; movzx  eax, byte [rdi]  ; debug check for double-unlocking.  Expect 1
    mov   byte [rdi], 0        ; lock.store(0, std::memory_order_release)
    ret

align 16
;;;;;void spin_unlock(volatile char *lock)
global spin_lock
spin_lock:
    mov   eax, 1                 ; only need to do this the first time, otherwise we know al is non-zero
.retry:
    xchg  al, [rdi]

    test  al,al                  ; check if we actually got the lock
    jnz   .spinloop
    ret                          ; no taken branches on the fast-path

align 8
.spinloop:                    ; do {
    pause
    cmp   byte [rdi], al      ; C++11
    jne   .retry              ; if (lock.load(std::memory_order_acquire) != 1)
    jmp   .spinloop

; if not translating this to inline asm, you could put the spin loop *before* the function entry point, saving the last jmp
; but since this is probably too simplistic for real use, I'm going to leave it as-is.

A plain store has release semantics, but not sequential-consistency (which you'd get from an xchg or something). Acquire/release is enough to protect a critical section (hence the name).

If you were using a bitfield of atomic flags, you could use lock bts (test and set) for the equivalent of xchg-with-1. You can spin on bt or test. To unlock, you'd need lock btr, not just btr, because it would be a non-atomic read-modify-write of the byte, or even the containing 32-bits.

With a byte or int sized lock like you should normally use, you don't even need a locked operation to unlock; release semantics are enough. glibc's pthread_spin_unlock does it the same as my unlock function: a simple store.

(lock bts is not necessary; xchg or lock cmpxchg are just as good if for a normal lock.)

The first access should be an atomic RMW

See discussion on Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock? - if the first access is read-only, the CPU might send out just a share request for that cache line. Then, if it sees the line unlocked (the hopefully-common low-contention case) it would have to send out an RFO (Read For Ownership) to actually be able to write the cache line. So that's twice as many off-core transactions.

The downside is that this will take MESI exclusive ownership of that cache line, but what really matters is that the thread owning the lock can efficiently store a 0 so we can see it unlocked. Either way, read-only or RMW, that core will lose exclusive ownership of the line and have to RFO before it can commit that unlocking store.

I think a read-only first access would just optimize for slightly less traffic between cores when multiple threads queue up to wait for a lock that's already taken. That would be a silly thing to optimize for.

(Fastest inline-assembly spinlock also tested the idea for a massively contended spinlock with multiple threads doing nothing but trying to take the lock, with poor results. That linked answer makes some incorrect claims about xchg globally locking a bus - aligned locks don't do that, just a cache lock (Is incrementing an int effectively atomic in specific cases?), and each core can be doing a separate atomic RMW on a different cache line at the same time.)

However, if that initial attempt finds it locks, we don't want to keep hammering on the cache line with atomic RMWs. That's when we fall back to read-only. 10 threads all spamming xchg for the same spinlock would keep the memory arbitration hardware pretty busy. It would likely delay the visibility of the store that unlocks (because that thread has to contend for exclusive ownership of the line), so it's directly counter-productive. It may also memory in general in general for other cores.

PAUSE is also essential, to avoid mis-speculation about memory ordering by the CPU. You exit the loop only when the memory you're reading was modified by another core. However, we don't want to pause in the un-contended case. On Skylake, PAUSE waits a lot longer, like ~100 cycles up from ~5, so you should definitely keep the spin-loop separate from the initial check for unlocked.

I'm sure Intel's and AMD's optimization manuals talk about this, see the x86 tag wiki for that and tons of other links.

Not good enough? Should I for example make use of the register keyword in C?

register is a meaningless hint in modern optimizing compilers, except in debug builds (gcc -O0).

Blanchette answered 16/5, 2016 at 3:25 Comment(4)

I now I am not allowed to say this but Thanks for the insight! A question to justify this comment: Would the volatile keyword force the compiler to use the register keyword? Even if that is the case, would there be any benefit/point to use register keyword for locks to begin with? – Vermis 16/5, 2016 at 12:13

volatile is kind of the opposite of register. It means the value must be re-read from memory every time it's referenced, and every store must be done separately and in program order. What are you hoping that register will do in the generated asm, anyway? It doesn't make any sense for a locking implementation, even if it did anything. – Blanchette 16/5, 2016 at 12:22

It is just lack of understanding on my part. I was unsure how locks are made to begin with. From an amateur perspective, it seems it might make sense as locks are kinda special given their nature. They have to be fast and make use of atomic operations so we can have threadsafty and reenterability. For an amature like me, reserving a register for locks seemed it might make sense; but I wasn't quite certain about that part either. I was thinking about something in lines of "int register res asm("r0")=0;" – Vermis 16/5, 2016 at 13:0

@user1235831: 1. each thread has its own architectural state (including registers), so a register variable for a lock can't work. Read Jeff Preshing's blog posts about memory ordering, they're great. 2. slowing down the rest of the code by tying up a register permanently is a terrible idea. Most code doesn't spend much of its time on locking. 3. Locking is inherently expensive. You're more likely to get a benefit from avoiding locking as much as possible through careful design. – Blanchette 16/5, 2016 at 13:7

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

The first access should be an atomic RMW

Recommended topics

Hot tags