I'm not used to post any question on the internet so please tell me if I'm doing something wrong.

In short

How to correctly prevent false sharing on a 64bits architecture with a CPU cacheline size of 64bytes ?
How does the usage of C++ 'alignas' keyword and a simple byte array (ex: char[64]) can affect multithreading efficiency ?

Context

While working on a very efficient implementation of a Single Consumer Single Producer Queue, I've encountered unlogical behaviours from GCC compiler while benchmarking my code.

Full story

I hope somebody will have the necessary knowledge to explain what is going on.

I'm currently using GCC 10.2.0 and its C++ 20 implementation on arch linux. My laptop is a Lenovo T470S having a i7-7500U processor.

Let me begin with the data structure:

class SPSCQueue
{
public:
    ...

private:
    alignas(64) std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
    Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
    std::size_t _headCache { 0 }; // Head cache for the producer
    char _pad0[64 - sizeof(Buffer) - sizeof(std::size_t)]; // 64 bytes alignment padding

    alignas(64) std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
    Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
    std::size_t _tailCache { 0 }; // Head cache for the consumer
    char _pad1[64 - sizeof(Buffer) - sizeof(std::size_t)]; // 64 bytes alignment padding
};

The following data structure obtains a fast and stable 20ns at pushing / poping on my system.

However, only changing the alignment using the following members makes the benchmark unstable and give between 20 and 30ns.

    alignas(64) std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
    struct alignas(64) {
        Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
        std::size_t _headCache { 0 }; // Head cache for the producer
    };

    alignas(64) std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
    struct alignas(64) {
        Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer1
        std::size_t _tailCache { 0 }; // Tail cache for the consumer
    };

Lastly, I got even more lost when I tried this configuration giving me results between 40 and 55ns.


    std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
    char _pad0[64 - sizeof(std::atomic<size_t>)];
    Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
    std::size_t _headCache { 0 }; // Head cache for the producer
    char _pad1[64 - sizeof(Buffer) - sizeof(std::size_t)];

    std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
    char _pad2[64 - sizeof(std::atomic<size_t>)];
    Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
    std::size_t _tailCache { 0 }; // Head cache for the consumer
    char _pad3[64 - sizeof(Buffer) - sizeof(std::size_t)];

This time I got the queue push/pop oscillating both between 40 and 55ns.

I'm very lost at this point because I don't know where should I look for answers. Until now the C++ memory layout has been very intuitive for me but I realized that I still miss very important knowledges to be better at high frequency multithreading.

Minimal code sample

If you wish to compile the whole code to test it yourself, here are the few files needed:

SPSCQueue.hpp:


#pragma once

#include <atomic>
#include <cstdlib>
#include <cinttypes>

#define KF_ALIGN_CACHELINE alignas(kF::Core::Utils::CacheLineSize)

namespace kF::Core
{
    template<typename Type>
    class SPSCQueue;

    namespace Utils
    {
        /** @brief Helper used to perfect forward move / copy constructor */
        template<typename Type, bool ForceCopy = false>
        void ForwardConstruct(Type *dest, Type *source) {
            if constexpr (!ForceCopy && std::is_move_assignable_v<Type>)
                new (dest) Type(std::move(*source));
            else
                new (dest) Type(*source);
        }

        /** @brief Helper used to perfect forward move / copy assignment */
        template<typename Type, bool ForceCopy = false>
        void ForwardAssign(Type *dest, Type *source) {
            if constexpr (!ForceCopy && std::is_move_assignable_v<Type>)
                *dest = std::move(*source);
            else
                *dest = *source;
        }

        /** @brief Theorical cacheline size */
        constexpr std::size_t CacheLineSize = 64ul;
    }
}

/**
 * @brief The SPSC queue is a lock-free queue that only supports a Single Producer and a Single Consumer
 * The queue is really fast compared to other more flexible implementations because the fact that only two thread can simultaneously read / write
 * means that less synchronization is needed for each operation.
 * The queue supports ranged push / pop to insert multiple elements without performance impact
 *
 * @tparam Type to be inserted
 */
template<typename Type>
class kF::Core::SPSCQueue
{
public:
    /** @brief Buffer structure containing all cells */
    struct Buffer
    {
        Type *data { nullptr };
        std::size_t capacity { 0 };
    };

    /** @brief Local thread cache */
    struct Cache
    {
        Buffer buffer {};
        std::size_t value { 0 };
    };

    /** @brief Default constructor initialize the queue */
    SPSCQueue(const std::size_t capacity);

    /** @brief Destruct and release all memory (unsafe) */
    ~SPSCQueue(void) { clear(); std::free(_buffer.data); }

    /** @brief Push a single element into the queue
     *  @return true if the element has been inserted */
    template<typename ...Args>
    [[nodiscard]] inline bool push(Args &&...args);

    /** @brief Pop a single element from the queue
     *  @return true if an element has been extracted */
    [[nodiscard]] inline bool pop(Type &value);

    /** @brief Clear all elements of the queue (unsafe) */
    void clear(void);

private:
    KF_ALIGN_CACHELINE std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
    struct {
        Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
        std::size_t _headCache { 0 }; // Head cache for the producer
        char _pad0[Utils::CacheLineSize - sizeof(Buffer) - sizeof(std::size_t)];
    };

    KF_ALIGN_CACHELINE std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
    struct{
        Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
        std::size_t _tailCache { 0 }; // Head cache for the consumer
        char _pad1[Utils::CacheLineSize - sizeof(Buffer) - sizeof(std::size_t)];
    };

    /** @brief Copy and move constructors disabled */
    SPSCQueue(const SPSCQueue &other) = delete;
    SPSCQueue(SPSCQueue &&other) = delete;
};

static_assert(sizeof(kF::Core::SPSCQueue<int>) == 4 * kF::Core::Utils::CacheLineSize);

template<typename Type>
kF::Core::SPSCQueue<Type>::SPSCQueue(const std::size_t capacity)
{
    _buffer.capacity = capacity;
    if (_buffer.data = reinterpret_cast<Type *>(std::malloc(sizeof(Type) * capacity)); !_buffer.data)
        throw std::runtime_error("Core::SPSCQueue: Malloc failed");
    _buffer2 = _buffer;
}

template<typename Type>
template<typename ...Args>
bool kF::Core::SPSCQueue<Type>::push(Args &&...args)
{
    static_assert(std::is_constructible<Type, Args...>::value, "Type must be constructible from Args...");

    const auto tail = _tail.load(std::memory_order_relaxed);
    auto next = tail + 1;

    if (next == _buffer.capacity) [[unlikely]]
        next = 0;
    if (auto head = _headCache; next == head) [[unlikely]] {
        head = _headCache = _head.load(std::memory_order_acquire);
        if (next == head) [[unlikely]]
            return false;
    }
    new (_buffer.data + tail) Type{ std::forward<Args>(args)... };
    _tail.store(next, std::memory_order_release);
    return true;
}

template<typename Type>
bool kF::Core::SPSCQueue<Type>::pop(Type &value)
{
    const auto head = _head.load(std::memory_order_relaxed);

    if (auto tail = _tailCache; head == tail) [[unlikely]] {
        tail = _tailCache = _tail.load(std::memory_order_acquire);
        if (head == tail) [[unlikely]]
            return false;
    }
    auto *elem = reinterpret_cast<Type *>(_buffer2.data + head);
    auto next = head + 1;
    if (next == _buffer2.capacity) [[unlikely]]
        next = 0;
    value = std::move(*elem);
    elem->~Type();
    _head.store(next, std::memory_order_release);
    return true;
}

template<typename Type>
void kF::Core::SPSCQueue<Type>::clear(void)
{
    for (Type type; pop(type););
}

The benchmark, using google benchmark. bench_SPSCQueue.cpp:

#include <thread>

#include <benchmark/benchmark.h>

#include "SPSCQueue.hpp"

using namespace kF;

using Queue = Core::SPSCQueue<std::size_t>;

constexpr std::size_t Capacity = 4096;

static void SPSCQueue_NoisyPush(benchmark::State &state)
{
    Queue queue(Capacity);
    std::atomic<bool> running = true;
    std::size_t i = 0ul;
    std::thread thd([&queue, &running] { for (std::size_t tmp; running; benchmark::DoNotOptimize(queue.pop(tmp))); });
    for (auto _ : state) {
        decltype(std::chrono::high_resolution_clock::now()) start;
        do {
            start = std::chrono::high_resolution_clock::now();
        } while (!queue.push(42ul));
        auto end = std::chrono::high_resolution_clock::now();
        auto elapsed = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
        auto iterationTime = elapsed.count();
        state.SetIterationTime(iterationTime);
    }
    running = false;
    if (thd.joinable())
        thd.join();
}
BENCHMARK(SPSCQueue_NoisyPush)->UseManualTime();

static void SPSCQueue_NoisyPop(benchmark::State &state)
{
    Queue queue(Capacity);
    std::atomic<bool> running = true;
    std::size_t i = 0ul;
    std::thread thd([&queue, &running] { while (running) benchmark::DoNotOptimize(queue.push(42ul)); });
    for (auto _ : state) {
        std::size_t tmp;
        decltype(std::chrono::high_resolution_clock::now()) start;
        do {
            start = std::chrono::high_resolution_clock::now();
        } while (!queue.pop(tmp));
        auto end = std::chrono::high_resolution_clock::now();
        auto elapsed = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
        auto iterationTime = elapsed.count();
        state.SetIterationTime(iterationTime);
    }
    running = false;
    if (thd.joinable())
        thd.join();
}
BENCHMARK(SPSCQueue_NoisyPop)->UseManualTime();

Thanks to your useful comments (and mostly, thanks to Peter Cordes), it seems that the issue was coming from the L2 data prefetcher.

Because of my SPSC queue design, each thread must access two consecutive cachelines to push / pop the queue. If the structure itself is not aligned to 128 bytes, its address will not be aligned on 128 bytes and the compiler will not be able to optimize out the access of the two aligned cacheline.

Thus, the simple fix is:

template<typename Type>
class alignas(128) SPSCQueue { ... };

Here (section 2.5.5.4 Data Prefetching) is a interesting paper from Intel explaining optimization on their architectures and how prefetching is done in various levels of caches.

In short

Context

Full story

Minimal code sample

Recommended topics

Hot tags