I have a problem that I need to understand if there is a better solution. I have written the following code to pass a few variables from a writer thread to a reader thread. These threads pinned to different CPUs sharing the same L2 cache (disabled hyperthreading).
writer_thread.h
struct a_few_vars {
uint32_t x1;
uint32_t x2;
uint64_t x3;
uint64_t x4;
} __attribute__((aligned(64)));
volatile uint32_t head;
struct a_few_vars xxx[UINT16_MAX] __attribute__((aligned(64)));
reader_thread.h
uint32_t tail;
struct a_few_vars *p_xxx;
The writer thread increases the head variable and the reader thread checks whether the head variable and the tail is equal. If they are not equal then it reads the new data as follows
while (true) {
if (tail != head) {
.. process xxx[head] ..
.. update tail ..
}
}
Performance is by far the most important issue. I'm using Intel Xeon processors and the reader thread fetches the head value and the xxx[head] data from memory each time. I used the aligned array to be lock free
In my case, is there any method to flush the variables to the reader CPU cache as soon as possible. Can I trigger a prefetch for the reader CPU from writer CPU. I can use special Intel instructions using __asm__ if exist. In conclusion, what is the fastest way to pass the variables in the struct between threads pinning to different CPUs?
Thanks in advance
volatile
is insufficient to prevent a race condition. You'll need a mutex or you'll have to access the variables via primitives fromstdatomic.h
– Sackbutx86
, you need anmfence
variant). So, once again, either a mutex or atomic operation – Sackbutmfence
. You just need to stop compile-time reordering. The writer thread is write-only, so all you need is ordered writes. x86 asm does acquire / release semantics for free, and we don't need seq-cst here. (memory_order_release
compiles without any extra barrier instructions on x86, just blocking compile-time reordering.)mfence
doesn't make data visible to the other core any faster, it just stalls the current core's later loads until earlier stores are globally visible. Cache coherency doesn't take extra instructions, just ordering. – Herzogx86
arch knowledge. IMO, at a minimum, the code should include comments as to why, in the specific case, the more general solution isn't required as this is relying on something that may eventually break on a different model or arch (e.g.arm
) – Sackbutatomic_store_explicit(..., memory_order_release)
. On x86, that compiles to just amov
store. On ARM, that will compile to a store +dsb ish
. On AArch64, that will compile to astra
or whatever it's called, a sequential-consistency release store but no barrier. You don't want to mess around with targeting the asm memory model from C using_mm_mfence()
manually or stuff like that, because the whole point of C11 stdatomic is to let you write portable code and have the compiler do that for you. It's nice to know what's efficient, though. – Herzog