Atomic swap in C
Asked Answered
H

2

7

I think I'm missing something obvious here. I have code like this:

int *a = <some_address>;
int *b = <another_address>;

// ...

// desired atomic swap of addresses a and b here
int *tmp = a;
a = b; b = tmp;

I want to do this atomically. I've been looking at __c11_atomic_exchange and on OSX the OSAtomic methods, but nothing seems to perform a straight swap atomically. They can all write a value into 1 of the 2 variables atomically, but never both.

Any ideas?

Heid answered 18/1, 2014 at 3:58 Comment(3)
The thing that you are missing is the fact that are no operations that interact atomically with two memory locations simultaneously. If you want to atomically swap two objects, you'll need to protect them with a mutex of some kind.Moffat
Thanks Casey, that was a duh! moment...Heid
Only a few machines have ever had the ability to atomically operate on two disjoint locations, e.g. 68020 through 68040 had a CAS2 en.wikipedia.org/wiki/Double_compare-and-swap instruction. And if you have transactional memory like Intel RTM, then it's something you can do as a transaction. But it's so rare that ISO C and C++ don't expose such an operation, otherwise atomic_int would have to be non-lock-free on almost all architectures. (Because it can't be synthesized out of other things without locking.)Orphanage
G
8

It depends a bit on your architecture, what is effectively and efficiently possible. In any case you would need that the two pointers that you want to swap be adjacent in memory, the easiest that you have them as some like

typedef struct pair { void *a[2]; } pair;

Then you can, if you have a full C11 compiler use _Atomic(pair) as an atomic type to swap the contents of such a pair: first load the actual pair in a temporary, construct a new one with the two values swapped, do a compare-exchange to store the new value and iterate if necessary:

inline
void pair_swap(_Atomic(pair) *myPair) {
  pair actual = { 0 };
  pair future = { 0 };

  while (!atomic_compare_exchange_weak(myPair, &actual, future)) {
      future.a[0] = actual.a[1];
      future.a[1] = actual.a[0];
  }
}

Depending on your architecture this might be realized with a lock free primitive operation or not. Modern 64 bit architectures usually have 128bit atomics, there this could result in just some 128 bit assembler instructions.

But this would depend a lot on the quality of the implementation of atomics on your platform. P99 would provide you with an implementation that implements the "functionlike" part of atomic operations, and that supports 128 bit swap, when detected; on my machine (64 bit Intel linux) this effectively compiles into a lock cmpxchg16b instruction.

Relevant documentation:

  1. https://en.cppreference.com/w/c/language/atomic
  2. https://en.cppreference.com/w/c/thread
  3. https://en.cppreference.com/w/c/atomic/atomic_exchange
  4. https://en.cppreference.com/w/c/atomic/atomic_compare_exchange
Geter answered 18/1, 2014 at 9:38 Comment(1)
For people wanting to learn more about the types and functions you provided, here is some useful documentation: en.cppreference.com/w/c/language/atomic, en.cppreference.com/w/c/thread, en.cppreference.com/w/c/atomic/atomic_exchange, en.cppreference.com/w/c/atomic/atomic_compare_exchange.Quinquagesima
Q
0

I think this is worth presenting up front. This spin lock approach I'm about to present is valid and correct, but it is also scheduler-unaware, meaning that it won't cooperate with the scheduler like a standard mutex, but rather it will fight to function during the time slices it is allowed to run in (randomly from its perspective) by the scheduler. So, it has some tradeoffs.

Here are some comments I've written below this answer regarding this:

I read the whole article by Linus that you linked to. He's criticizing a very specific benchmark spin lock code which is attempting to characterize the Linux scheduler by using spin locks on a timer. Linus is right though: the OS scheduler benchmark routine being described is broken because there is an unsynchronized race condition between the scheduler and that spin lock. That is unrelated to what I'm doing in my spin lock above. I understand the scheduler limitations and potential conflicts...

...with my spin lock implementation. Furthermore, I'm not writing the low-level locking primitives from scratch, which is the part people will almost certainly get wrong. Rather, my C++ compiler (g++) is. That critical section of code is written in the <stdatomic.h> header as atomic_test_and_set and the type atomic_bool which I use too. So, as long as the spin lock limitations are understood, what I have done above is a perfectly legitimate way to approach a problem like this. It should be benchmarked and compared to other approaches to know if it has merit over a standard C++ mutex...

...C mutex, or pthread mutex [in a given contect]. For some of the modern C++ mutexes, see my article on my personal website here: https://gabrielstaples.com/cpp-mutexes-and-locks. Additionally, my answer here isn't meant to present the best approach. It's meant to present a valid approach. The user needs to understand the limitations. It may be used in other contexts too, such as with the Linux soft-real-time SCHED_RR round-robin scheduler, or in a kernel module, or elsewhere. There are certainly tradeoffs, but that doesn't mean alternative approaches like this should not be presented.

Now, if you want to see an interesting spin lock approach, here you go:

Quick summary: spin-lock-based atomic swap answer

This code snippet is what we will be building up to.

For a basic spin lock (high speed, no-context-switch mutex) approach on a multi-core system, which is where spin locks might be most useful (they require multi-core, not just multi-threaded, systems to be functional), you can do an atomic swap of variable values (pointers in this case) like this:

int *a = &some_int1;  // ptr a
int *b = &some_int2;  // ptr b
spinlock_t spinlock = SPINLOCK_UNLOCKED;

// swap pointers `a` and `b` (note: swapping the pointers themselves, not the
// integer values they point to)
atomic_swap_int_ptr(&spinlock, &a, &b);

Important clarification by @Peter Cordes:

...out of any systems where you might consider using spinlocks, they only make any sense on multi-core systems, and its better with pre-emptive multi-tasking.

Here are the details:

General-purpose spin lock approach in C11 (to wrap any "critical section" that needs to be atomic)

If your architecture supports lock-free atomic swaps using the array "pair" presented in the main answer here, then consider that. It may be lock-free and hence much more efficient than using a mutex or spin lock.

I wanted to delve into this a bit deeper though and try my hand at a spin lock. I have not speed profiled the various techniques (array "pair", mutex, and spin lock), but it would be very interesting and academic for me to do so to see how they compare in speed.

Anyway, as I was saying: you could also use a mutex.

Or, if you don't want your thread to go to sleep if the mutex isn't immediately available, you can improve efficiency by using a spin lock, since the number of times the spin lock has to spin is frequently so small that it is faster to spin than to put a thread to sleep and have a whole context switch.

C11 has a whole suite of atomic types and functions which allow you to easily implement your own spin lock to accomplish this.

Keep in mind, however, that a basic spin lock must not be used on a bare metal (no operating system) single core (processor) system, as it will result in deadlock if the main context takes the lock and then while the lock is still taken, an ISR fires and tries to take the lock. The ISR will block indefinitely.

To avoid this, more-complicated locking mechanisms must be used, such as a spin lock with a timeout, an ISR which auto-reschedules itself for a slightly later time if the lock is taken, automatic yielding, etc.

So, spin locks are best reserved for multi-core systems with an OS and scheduler.

Basic spin lock implementation in C11 using _Atomic types

#include <stdbool.h>
#include <stdatomic.h>


#define SPINLOCK_UNLOCKED 0
#define SPINLOCK_LOCKED   1

typedef volatile atomic_bool spinlock_t;

// This is how you'd create a new spinlock to use in the functions below.
// spinlock_t spinlock = SPINLOCK_UNLOCKED;

/// \brief  Set the spin lock to `SPINLOCK_LOCKED`, and return the previous 
///  state of the `spinlock` object.
/// \details    If the previous state was `SPINLOCK_LOCKED`, you can know that
///  the lock was already taken, and you do NOT now own the lock. If the
///  previous state was `SPINLOCK_UNLOCKED`, however, then you have now locked
///  the lock and own the lock.
bool atomic_test_and_set(spinlock_t* spinlock)
{
    bool spinlock_val_old = atomic_exchange(spinlock, SPINLOCK_LOCKED);
    return spinlock_val_old;
}

/// Spin (loop) until you take the spin lock.
void spinlock_lock(spinlock_t* spinlock)
{
    // Spin lock: loop forever until we get the lock; we know the lock was
    // successfully obtained after exiting this while loop because the
    // atomic_test_and_set() function locks the lock and returns the previous
    // lock value. If the previous lock value was SPINLOCK_LOCKED then the lock
    // was **already** locked by another thread or process. Once the previous
    // lock value was SPINLOCK_UNLOCKED, however, then it indicates the lock
    // was **not** locked before we locked it, but now it **is** locked because
    // we locked it, indicating we own the lock.
    while (atomic_test_and_set(spinlock) == SPINLOCK_LOCKED) {};
}

/// Release the spin lock.
void spinlock_unlock(spinlock_t* spinlock)
{
    *spinlock = SPINLOCK_UNLOCKED;
}

Now, you can use the spin lock, for your example, in your multi-core system, like this:

/// Atomically swap two pointers to int (`int *`). 
/// NB: this does NOT swap their contents, meaning: the integers they point to.
/// Rather, it swaps the pointers (addresses) themselves.
void atomic_swap_int_ptr(spinlock_t * spinlock, int ** ptr1, int ** ptr2)
{
    spinlock_lock(spinlock)
    // critical, atomic section start

    int * temp = *ptr1;
    *ptr1 = *ptr2;
    *ptr2 = temp;
    
    // critical, atomic section end
    spinlock_unlock(spinlock)
}


// Demo to perform an address swap

int *a = &some_int1;
int *b = &some_int2;
spinlock_t spinlock = SPINLOCK_UNLOCKED;

// swap pointers `a` and `b` (note: swapping the pointers themselves, not the
// integer values they point to)
atomic_swap_int_ptr(&spinlock, &a, &b);  // <=========== final answer

References

  1. https://en.wikipedia.org/wiki/Test-and-set#Pseudo-C_implementation_of_a_spin_lock
  2. ***** C11 Concurrency support library and atomic_bool and other types: https://en.cppreference.com/w/c/thread
  3. C11 atomic_exchange() function: https://en.cppreference.com/w/c/atomic/atomic_exchange

Other links

  1. Note that you could use C11's struct atomic_flag as well, and its corresponding atomic_flag_test_and_set() function, instead of creating your own atomic_test_and_set() function using atomic_exchange(), like I did above.
  2. _Atomic types in C11: https://en.cppreference.com/w/c/language/atomic
  3. C++11 Implementation of Spinlock using header <atomic>

TODO (newest on top)

  1. [ ] 20230418: Upgrade this implementation and speed profile it with the following changes. See C concurrency documentation here: https://en.cppreference.com/w/c/thread:
    1. [ ] Use atomic_flag instead of atomic_bool: https://en.cppreference.com/w/c/atomic/atomic_flag (emphasis added):

      atomic_flag is an atomic boolean type. Unlike other atomic types, it is guaranteed to be lock-free. Unlike atomic_bool, atomic_flag does not provide load or store operations.

      1. Be sure to initialize it with ATOMIC_FLAG_INIT.
      2. Also, use atomic_flag_test_and_set() instead of my custom atomic_test_and_set() above, and use atomic_flag_clear() to clear or "unset" the flag.
  2. [ ] 20220922: This implementation would need to be modified to add safety anti-deadlock mechanisms such as automatic deferral, timeout, and retry, to prevent deadlock on single core and bare-metal systems. See my notes here: Add basic mutex (lock) and spin lock implementations in C11, C++11, AVR, and STM32 and here: What are the various ways to disable and re-enable interrupts in STM32 microcontrollers in order to implement atomic access guards?
Quinquagesima answered 22/9, 2022 at 18:15 Comment(12)
spinlocks only need acquire / release semantics, not the default seq_cst. BTW, your phrasing might be misread as saying that on multi-core systems, spinlocks are the best kind of locking. Rather than what I think you mean, that out of any systems where you might consider using spinlocks, they only make any sense on multi-core systems, and its better with pre-emptive multi-tasking. Agreed that spinlocks are a reasonable choice here, though.Orphanage
If you wanted to use separate locks for each of the two objects being swapped (or a hash table of locks), that could allow more parallelism between different threads swapping different pairs. To avoid deadlocks, acquire the locks in order of increasing address, so if two operations both need to lock the same pair of objects, you avoid a situation where one locks one and the other locks the other. (If you use a hash table of locks, you need to check if they both hash to the same entry and only lock that one.) But that's more locking operations, maybe slower if one lock doesn't contend badly.Orphanage
IIRC, some compilers use spinlocks instead of full mutexes for non-lock-free std::atomicOrphanage
atomic_bool is also lock-free on real-world implementations. The fact that atomic_flag is required to be lock-free isn't a real advantage in practice. The difference is in the available operations. (Especially in C++ before C++20 added .test() to read the value without modifying it! Before that, you could only write or RMW an atomic_flag! en.cppreference.com/w/cpp/atomic/atomic_flag . C still has that nasty limitation, only providing TAS and clear on atomic_flag, making it impossible to spin read-only until a retry might work. en.cppreference.com/w/c/thread)Orphanage
@PeterCordes, so would you recommend I leave my implementation as-is, or make my "todo" changes to use atomic_flag with atomic_flag_set() and atomic_flag_clear()?Quinquagesima
You'll probably get the same asm either way from GCC for any(?) architecture. For x86 at least, it uses xchg to implement atomic_flag_test_and_set_explicit. (And for .clear() or atomic_bool assignment, if you don't use memory_order_release to let it just use a cheaper mov.) Maybe for some ISAs, it would use a special test-and-set instruction if one existed but an atomic exchange was more cumbersome. Might be interesting to play with on godbolt.org if it has some historical ISAs like m68k.Orphanage
Spinlocks should never be used on user-space. realworldtech.com/forum/?threadid=189711&curpostid=189723Serious
@RafaelGago, I read the whole article by Linus that you linked to. He's criticizing a very specific benchmark spin lock code which is attempting to characterize the Linux scheduler by using spin locks on a timer. Linus is right though: the OS scheduler benchmark routine being described is broken because there is an unsynchronized race condition between the scheduler and that spin lock. That is unrelated to what I'm doing in my spin lock above. I understand the scheduler limitations and potential conflicts...Quinquagesima
...with my spin lock implementation. Furthermore, I'm not writing the low-level locking primitives from scratch, which is the part people will almost certainly get wrong. Rather, my C++ compiler (g++) is. That critical section of code is written in the <stdatomic.h> header as atomic_test_and_set and the type atomic_bool which I use too. So, as long as the spin lock limitations are understood, what I have done above is a perfectly legitimate way to approach a problem like this. It should be benchmarked and compared to other approaches to know if it has merit over a standard C++ mutex...Quinquagesima
...C mutex, or pthread mutex. For some of the modern C++ mutexes, see my article on my personal website here: gabrielstaples.com/cpp-mutexes-and-locks. Additionally, my answer here isn't meant to present the best approach. It's meant to present a valid approach. The user needs to understand the limitations. It may be used in other contexts too, such as with the Linux soft-real-time SCHED_RR round-robin scheduler, or in a kernel module, or elsewhere. There are certainly tradeoffs, but that doesn't mean alternative approaches like this should not be presented.Quinquagesima
Related: Performance pthread_spinlock_t is 2x better than my own implementation with lock free std::atomic_flag, around std::list shows a hand-rolled spinlock with C++ std::atomic_flag, and performance of a bad and probably misleading micro-benchmark of it on Skylake, for a few variations (with/without x86 pause, and with a read-only check before attempting another RMW).Orphanage
@GabrielStaples Probably I should have explained more what I wanted to mean when linking that valuable Linus chat. My points are two: 1) there is tipically no way to atomically swap two different addresses, which is what the OP asked. Locking is required. 2) the sane default for short time locking in user space is a mutex, not a spinlock. It doesn't matter if correctly implemented or not. If you want a less naive user-space spinlock: rigtorp.se/spinlockSerious

© 2022 - 2024 — McMap. All rights reserved.