I'm not used to post any question on the internet so please tell me if I'm doing something wrong.
In short
How to correctly prevent false sharing on a 64bits architecture with a CPU cacheline size of 64bytes ?
How does the usage of C++ 'alignas' keyword and a simple byte array (ex: char[64]) can affect multithreading efficiency ?
Context
While working on a very efficient implementation of a Single Consumer Single Producer Queue, I've encountered unlogical behaviours from GCC compiler while benchmarking my code.
Full story
I hope somebody will have the necessary knowledge to explain what is going on.
I'm currently using GCC 10.2.0 and its C++ 20 implementation on arch linux. My laptop is a Lenovo T470S having a i7-7500U processor.
Let me begin with the data structure:
class SPSCQueue
{
public:
...
private:
alignas(64) std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
std::size_t _headCache { 0 }; // Head cache for the producer
char _pad0[64 - sizeof(Buffer) - sizeof(std::size_t)]; // 64 bytes alignment padding
alignas(64) std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
std::size_t _tailCache { 0 }; // Head cache for the consumer
char _pad1[64 - sizeof(Buffer) - sizeof(std::size_t)]; // 64 bytes alignment padding
};
The following data structure obtains a fast and stable 20ns at pushing / poping on my system.
However, only changing the alignment using the following members makes the benchmark unstable and give between 20 and 30ns.
alignas(64) std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
struct alignas(64) {
Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
std::size_t _headCache { 0 }; // Head cache for the producer
};
alignas(64) std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
struct alignas(64) {
Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer1
std::size_t _tailCache { 0 }; // Tail cache for the consumer
};
Lastly, I got even more lost when I tried this configuration giving me results between 40 and 55ns.
std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
char _pad0[64 - sizeof(std::atomic<size_t>)];
Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
std::size_t _headCache { 0 }; // Head cache for the producer
char _pad1[64 - sizeof(Buffer) - sizeof(std::size_t)];
std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
char _pad2[64 - sizeof(std::atomic<size_t>)];
Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
std::size_t _tailCache { 0 }; // Head cache for the consumer
char _pad3[64 - sizeof(Buffer) - sizeof(std::size_t)];
This time I got the queue push/pop oscillating both between 40 and 55ns.
I'm very lost at this point because I don't know where should I look for answers. Until now the C++ memory layout has been very intuitive for me but I realized that I still miss very important knowledges to be better at high frequency multithreading.
Minimal code sample
If you wish to compile the whole code to test it yourself, here are the few files needed:
SPSCQueue.hpp:
#pragma once
#include <atomic>
#include <cstdlib>
#include <cinttypes>
#define KF_ALIGN_CACHELINE alignas(kF::Core::Utils::CacheLineSize)
namespace kF::Core
{
template<typename Type>
class SPSCQueue;
namespace Utils
{
/** @brief Helper used to perfect forward move / copy constructor */
template<typename Type, bool ForceCopy = false>
void ForwardConstruct(Type *dest, Type *source) {
if constexpr (!ForceCopy && std::is_move_assignable_v<Type>)
new (dest) Type(std::move(*source));
else
new (dest) Type(*source);
}
/** @brief Helper used to perfect forward move / copy assignment */
template<typename Type, bool ForceCopy = false>
void ForwardAssign(Type *dest, Type *source) {
if constexpr (!ForceCopy && std::is_move_assignable_v<Type>)
*dest = std::move(*source);
else
*dest = *source;
}
/** @brief Theorical cacheline size */
constexpr std::size_t CacheLineSize = 64ul;
}
}
/**
* @brief The SPSC queue is a lock-free queue that only supports a Single Producer and a Single Consumer
* The queue is really fast compared to other more flexible implementations because the fact that only two thread can simultaneously read / write
* means that less synchronization is needed for each operation.
* The queue supports ranged push / pop to insert multiple elements without performance impact
*
* @tparam Type to be inserted
*/
template<typename Type>
class kF::Core::SPSCQueue
{
public:
/** @brief Buffer structure containing all cells */
struct Buffer
{
Type *data { nullptr };
std::size_t capacity { 0 };
};
/** @brief Local thread cache */
struct Cache
{
Buffer buffer {};
std::size_t value { 0 };
};
/** @brief Default constructor initialize the queue */
SPSCQueue(const std::size_t capacity);
/** @brief Destruct and release all memory (unsafe) */
~SPSCQueue(void) { clear(); std::free(_buffer.data); }
/** @brief Push a single element into the queue
* @return true if the element has been inserted */
template<typename ...Args>
[[nodiscard]] inline bool push(Args &&...args);
/** @brief Pop a single element from the queue
* @return true if an element has been extracted */
[[nodiscard]] inline bool pop(Type &value);
/** @brief Clear all elements of the queue (unsafe) */
void clear(void);
private:
KF_ALIGN_CACHELINE std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
struct {
Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
std::size_t _headCache { 0 }; // Head cache for the producer
char _pad0[Utils::CacheLineSize - sizeof(Buffer) - sizeof(std::size_t)];
};
KF_ALIGN_CACHELINE std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
struct{
Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
std::size_t _tailCache { 0 }; // Head cache for the consumer
char _pad1[Utils::CacheLineSize - sizeof(Buffer) - sizeof(std::size_t)];
};
/** @brief Copy and move constructors disabled */
SPSCQueue(const SPSCQueue &other) = delete;
SPSCQueue(SPSCQueue &&other) = delete;
};
static_assert(sizeof(kF::Core::SPSCQueue<int>) == 4 * kF::Core::Utils::CacheLineSize);
template<typename Type>
kF::Core::SPSCQueue<Type>::SPSCQueue(const std::size_t capacity)
{
_buffer.capacity = capacity;
if (_buffer.data = reinterpret_cast<Type *>(std::malloc(sizeof(Type) * capacity)); !_buffer.data)
throw std::runtime_error("Core::SPSCQueue: Malloc failed");
_buffer2 = _buffer;
}
template<typename Type>
template<typename ...Args>
bool kF::Core::SPSCQueue<Type>::push(Args &&...args)
{
static_assert(std::is_constructible<Type, Args...>::value, "Type must be constructible from Args...");
const auto tail = _tail.load(std::memory_order_relaxed);
auto next = tail + 1;
if (next == _buffer.capacity) [[unlikely]]
next = 0;
if (auto head = _headCache; next == head) [[unlikely]] {
head = _headCache = _head.load(std::memory_order_acquire);
if (next == head) [[unlikely]]
return false;
}
new (_buffer.data + tail) Type{ std::forward<Args>(args)... };
_tail.store(next, std::memory_order_release);
return true;
}
template<typename Type>
bool kF::Core::SPSCQueue<Type>::pop(Type &value)
{
const auto head = _head.load(std::memory_order_relaxed);
if (auto tail = _tailCache; head == tail) [[unlikely]] {
tail = _tailCache = _tail.load(std::memory_order_acquire);
if (head == tail) [[unlikely]]
return false;
}
auto *elem = reinterpret_cast<Type *>(_buffer2.data + head);
auto next = head + 1;
if (next == _buffer2.capacity) [[unlikely]]
next = 0;
value = std::move(*elem);
elem->~Type();
_head.store(next, std::memory_order_release);
return true;
}
template<typename Type>
void kF::Core::SPSCQueue<Type>::clear(void)
{
for (Type type; pop(type););
}
The benchmark, using google benchmark. bench_SPSCQueue.cpp:
#include <thread>
#include <benchmark/benchmark.h>
#include "SPSCQueue.hpp"
using namespace kF;
using Queue = Core::SPSCQueue<std::size_t>;
constexpr std::size_t Capacity = 4096;
static void SPSCQueue_NoisyPush(benchmark::State &state)
{
Queue queue(Capacity);
std::atomic<bool> running = true;
std::size_t i = 0ul;
std::thread thd([&queue, &running] { for (std::size_t tmp; running; benchmark::DoNotOptimize(queue.pop(tmp))); });
for (auto _ : state) {
decltype(std::chrono::high_resolution_clock::now()) start;
do {
start = std::chrono::high_resolution_clock::now();
} while (!queue.push(42ul));
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
auto iterationTime = elapsed.count();
state.SetIterationTime(iterationTime);
}
running = false;
if (thd.joinable())
thd.join();
}
BENCHMARK(SPSCQueue_NoisyPush)->UseManualTime();
static void SPSCQueue_NoisyPop(benchmark::State &state)
{
Queue queue(Capacity);
std::atomic<bool> running = true;
std::size_t i = 0ul;
std::thread thd([&queue, &running] { while (running) benchmark::DoNotOptimize(queue.push(42ul)); });
for (auto _ : state) {
std::size_t tmp;
decltype(std::chrono::high_resolution_clock::now()) start;
do {
start = std::chrono::high_resolution_clock::now();
} while (!queue.pop(tmp));
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
auto iterationTime = elapsed.count();
state.SetIterationTime(iterationTime);
}
running = false;
if (thd.joinable())
thd.join();
}
BENCHMARK(SPSCQueue_NoisyPop)->UseManualTime();
struct alignas(64)
works? I thought normally you'd put the alignas before thestruct
keyword. I'd guess that your anonymous struct means something different for alignment than you thought. – Buffyalignas(64)
on the first variable after the struct. An aligned struct means the first member will be aligned, and the size of the struct will be a multiple of the alignment so the first variable after the struct will be aligned. But the 2nd struct member gets no such guarantee. (Your non-struct way doesn't try to separate that from the buffer so I don't think you're trying to align it.) – Buffyoffsetof
to get more info about structure memory layout. I've played a little with yours, even though I don't know what to look for. See addedstatic_assert
s underSPSCQueue
in godbolt.org/z/M7Mdo7 My understanding for havingsizeof(Queue) == 4 * kF::Core::Utils::CacheLineSize
is to have memory in 4 caches: atomic0, data0, atomic1, data1. It does not seem to be the case, based onoffsetof
s – Mudskipperstatic_assert
s your added, you checked that 'data0' is in the same cacheline of 'atomic0'. However, it isn't the case: as you said, I've got 4 caches on 4 different cachelines. Anyway, I didn't thinked to used offsetof to verify each members so thank you, but it seems not to be the root of the issue. – Extremealignas(64)
to members,char _pad
is unnecessary, the compiler inserts padding for you. – Louannlouanna