Optimal way to pass a few variables between 2 threads pinning different CPUs
Asked Answered
T

1

2

I have a problem that I need to understand if there is a better solution. I have written the following code to pass a few variables from a writer thread to a reader thread. These threads pinned to different CPUs sharing the same L2 cache (disabled hyperthreading).

writer_thread.h

struct a_few_vars {
    uint32_t x1;
    uint32_t x2;

    uint64_t x3;
    uint64_t x4;
} __attribute__((aligned(64)));

volatile uint32_t head;
struct a_few_vars xxx[UINT16_MAX] __attribute__((aligned(64)));

reader_thread.h

uint32_t tail;
struct a_few_vars *p_xxx;

The writer thread increases the head variable and the reader thread checks whether the head variable and the tail is equal. If they are not equal then it reads the new data as follows

while (true) {
    if (tail != head) {
        .. process xxx[head] ..
        .. update tail ..
    }
}

Performance is by far the most important issue. I'm using Intel Xeon processors and the reader thread fetches the head value and the xxx[head] data from memory each time. I used the aligned array to be lock free

In my case, is there any method to flush the variables to the reader CPU cache as soon as possible. Can I trigger a prefetch for the reader CPU from writer CPU. I can use special Intel instructions using __asm__ if exist. In conclusion, what is the fastest way to pass the variables in the struct between threads pinning to different CPUs?

Thanks in advance

Teter answered 23/12, 2018 at 19:29 Comment(9)
volatile is insufficient to prevent a race condition. You'll need a mutex or you'll have to access the variables via primitives from stdatomic.hSackbut
@CraigEstey, that is not true because the writer thread first updates the xxx[head + 1], then updates head, so it is sufficient to prevent a race condition.Teter
Are you using an ancient Core 2 Xeon? Actually those didn't have HT, so it would have to be Pentium 4 Xeon for everything in your question to make sense. All newer Intel CPUs (Nehalem and newer) have private per-core L1 and L2, and only the last-level L3 cache is shared. This includes KNL (Xeon Phi).Herzog
@PeterCordes, you are right, I wrote the same L2 cache to emphasize that there is not any NUMA transation. The CPUs are in the same processor.Teter
Um, no ... Because you need something to guarantee that the cache on one CPU core is flushed and visible to the other core atomically (e.g. on x86, you need an mfence variant). So, once again, either a mutex or atomic operationSackbut
@CraigEstey: It's UB, but no you don't need mfence. You just need to stop compile-time reordering. The writer thread is write-only, so all you need is ordered writes. x86 asm does acquire / release semantics for free, and we don't need seq-cst here. (memory_order_release compiles without any extra barrier instructions on x86, just blocking compile-time reordering.) mfence doesn't make data visible to the other core any faster, it just stalls the current core's later loads until earlier stores are globally visible. Cache coherency doesn't take extra instructions, just ordering.Herzog
@PeterCordes Of course, I defer to your [vast] x86 arch knowledge. IMO, at a minimum, the code should include comments as to why, in the specific case, the more general solution isn't required as this is relying on something that may eventually break on a different model or arch (e.g. arm)Sackbut
Guys, please talk more clearly to ordinary people like me. So I should not need a memory barrier (mfence) for the problem, right?Teter
@CraigEstey: The Right Way is using C11 atomic_store_explicit(..., memory_order_release). On x86, that compiles to just a mov store. On ARM, that will compile to a store + dsb ish. On AArch64, that will compile to a stra or whatever it's called, a sequential-consistency release store but no barrier. You don't want to mess around with targeting the asm memory model from C using _mm_mfence() manually or stuff like that, because the whole point of C11 stdatomic is to let you write portable code and have the compiler do that for you. It's nice to know what's efficient, though.Herzog
H
5

It's undefined behaviour for one thread to write a volatile variable while another thread reads it, according to C11. volatile accesses are also not ordered with respect to other accesses. You want atomic_store_explicit(&head, new_value, memory_order_release) in the writer and atomic_load_explicit(&head, memory_order_acquire) in the reader to create acq/rel synchronization, and force the compiler to make the stores into your struct visible before the store to head which indicates to the reader that there's new data.

(tail is private to the reader thread, so there's no mechanism for the writer to wait for the reader to have seen new data before writing more. So technically there's a possible race on the struct contents if the writer thread writes again while the reader is still reading. So the struct should also be _Atomic).


You might want a seq-lock where the writer updates a sequence number and the reader checks it before and after copying out the variables. https://en.wikipedia.org/wiki/Seqlock This lets you detect and retry in the rare cases where the writer was in the middle of an update when the reader copied the data.

It's pretty good for write-only / read-only situations, especially if you don't need to worry about the reader missing an update.

See my attempt at a SeqLock in C++11: Implementing 64 bit atomic counter with 32 bit atomics and also how to implement a seqlock lock using c++11 atomic library

And GCC reordering up across load with `memory_order_seq_cst`. Is this allowed? shows another example (this one causes a gcc bug).

Porting these from C++11 std::atomic to C11 stdatomic should be straightforward. Make sure to use atomic_store_explicit, because the default memory ordering for plain atomic_store is memory_order_seq_cst which is slower.


Not much you can do will actually speed up the writer making its stores globally visible. A CPU core already commits stores from its store buffer to its L1d as quickly as possible (obeying the restrictions of the x86 memory model, which doesn't allow StoreStore reordering).

On a Xeon, see When CPU flush value in storebuffer to L1 Cache? for some info about different Snoop Modes and their effect on inter-core latency within a single socket.

The caches on multiple cores are coherent, using MESI to maintain coherency.

A reader spin-waiting on an atomic variable is probably the best you can do, using _mm_pause() inside the spin loop to avoid a memory-order mis-speculation pipeline clear when exiting the spin-loop.

You also don't want to wake up in the middle of a write and have to retry. You might want to put the seq-lock counter in the same cache line as the data, so those stores can hopefully be merged in the store buffer of the writing core.

Herzog answered 23/12, 2018 at 20:12 Comment(6)
Personally, I've come to like a ticket lock to prevent starvation and promote fairness: en.wikipedia.org/wiki/Ticket_lock All the rest about limiting the spin probably applies. From the page, I think that spin on the now_serving can be done with simple fetch (i.e. non-atomic) as only the fetch_and_inc needs to be atomic. I've used this in production code for a shipping product.Sackbut
@CraigEstey: it has to be an atomic fetch, e.g. atomic_fetch_explicit(&now_server, memory_order_acquire). It's not a read-modify-write, but it has to be atomic or else there could be tearing of the value. (And in C you need the compiler to assume that other threads can change the value.) Naturally-aligned pure loads/stores of integers/pointers are atomic for free in asm on most ISAs, but in C you must tell the compiler about it or else it's data-race undefined behaviour. e.g. Why is integer assignment on a naturally aligned variable atomic on x86?Herzog
That's still a lock where both the reader and writer are modifying the lock variable. That's probably good if you have multiple readers and multiple writers, but one write-only thread publishing to one or more read-only threads is a classic use-case for a seqlock or a lockless queue. (Single-writer / single-reader does simplify a lockless queue significantly).Herzog
I've always done the release with atomic, just to be safe [my impl. had multiple readers and writers]. The wiki is a [too] bit general [I only said the release might be done non-atomic because the wiki showed it that way]. To release the now_serving, the owner already knows the value to store (vs incrementing) if the code is restructured (e.g. ticketLock_acquire returns the my_ticket value and can store my_ticket + 1). If the value tears, it just delays the next acquirer a bit longer (i.e. atomic flush would be faster) but it would work [eventually]Sackbut
As to seqlock, I've just looked over the latest include/linux/seqlock.h and it seems a bit more heavyweight. But, it's not a pure lock, since the reader(s) have to retry if they detect that the seqlock value has changed: do { seqlock_acq ; do_stuff ; } while (seqlock_retry);. So, do_stuff is a bit restricted as what it can do. Suppose do_stuff needed the read value of what the lock was protecting, but did something irrevocable to another variable with that value.Sackbut
Hmm. now_serving is only modified while the lock is held, so it might actually be ok to make it non-atomic in C. But no, ticketLock_acquire spins on it changing, so C11 requires that it's an atomic variable for a reader to actually notice the change. The Wiki code is described as pseudocode, not a proper C11 implementation. The increment to now_serving could be done as a separate atomic load and store, avoiding an atomic RMW. e.g. *now_serving = 1 + *atomic_load_explict(now_serving, mo_relaxed);. (And the store part could be release, not seq-cst.)Herzog

© 2022 - 2024 — McMap. All rights reserved.