False sharing prevention with alignas is broken
Asked Answered
E

1

6

I'm not used to post any question on the internet so please tell me if I'm doing something wrong.

In short

  1. How to correctly prevent false sharing on a 64bits architecture with a CPU cacheline size of 64bytes ?

  2. How does the usage of C++ 'alignas' keyword and a simple byte array (ex: char[64]) can affect multithreading efficiency ?

Context

While working on a very efficient implementation of a Single Consumer Single Producer Queue, I've encountered unlogical behaviours from GCC compiler while benchmarking my code.

Full story

I hope somebody will have the necessary knowledge to explain what is going on.

I'm currently using GCC 10.2.0 and its C++ 20 implementation on arch linux. My laptop is a Lenovo T470S having a i7-7500U processor.

Let me begin with the data structure:

class SPSCQueue
{
public:
    ...

private:
    alignas(64) std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
    Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
    std::size_t _headCache { 0 }; // Head cache for the producer
    char _pad0[64 - sizeof(Buffer) - sizeof(std::size_t)]; // 64 bytes alignment padding

    alignas(64) std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
    Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
    std::size_t _tailCache { 0 }; // Head cache for the consumer
    char _pad1[64 - sizeof(Buffer) - sizeof(std::size_t)]; // 64 bytes alignment padding
};

The following data structure obtains a fast and stable 20ns at pushing / poping on my system.

However, only changing the alignment using the following members makes the benchmark unstable and give between 20 and 30ns.

    alignas(64) std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
    struct alignas(64) {
        Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
        std::size_t _headCache { 0 }; // Head cache for the producer
    };

    alignas(64) std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
    struct alignas(64) {
        Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer1
        std::size_t _tailCache { 0 }; // Tail cache for the consumer
    };

Lastly, I got even more lost when I tried this configuration giving me results between 40 and 55ns.


    std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
    char _pad0[64 - sizeof(std::atomic<size_t>)];
    Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
    std::size_t _headCache { 0 }; // Head cache for the producer
    char _pad1[64 - sizeof(Buffer) - sizeof(std::size_t)];

    std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
    char _pad2[64 - sizeof(std::atomic<size_t>)];
    Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
    std::size_t _tailCache { 0 }; // Head cache for the consumer
    char _pad3[64 - sizeof(Buffer) - sizeof(std::size_t)];

This time I got the queue push/pop oscillating both between 40 and 55ns.

I'm very lost at this point because I don't know where should I look for answers. Until now the C++ memory layout has been very intuitive for me but I realized that I still miss very important knowledges to be better at high frequency multithreading.

Minimal code sample

If you wish to compile the whole code to test it yourself, here are the few files needed:

SPSCQueue.hpp:


#pragma once

#include <atomic>
#include <cstdlib>
#include <cinttypes>

#define KF_ALIGN_CACHELINE alignas(kF::Core::Utils::CacheLineSize)

namespace kF::Core
{
    template<typename Type>
    class SPSCQueue;

    namespace Utils
    {
        /** @brief Helper used to perfect forward move / copy constructor */
        template<typename Type, bool ForceCopy = false>
        void ForwardConstruct(Type *dest, Type *source) {
            if constexpr (!ForceCopy && std::is_move_assignable_v<Type>)
                new (dest) Type(std::move(*source));
            else
                new (dest) Type(*source);
        }

        /** @brief Helper used to perfect forward move / copy assignment */
        template<typename Type, bool ForceCopy = false>
        void ForwardAssign(Type *dest, Type *source) {
            if constexpr (!ForceCopy && std::is_move_assignable_v<Type>)
                *dest = std::move(*source);
            else
                *dest = *source;
        }

        /** @brief Theorical cacheline size */
        constexpr std::size_t CacheLineSize = 64ul;
    }
}

/**
 * @brief The SPSC queue is a lock-free queue that only supports a Single Producer and a Single Consumer
 * The queue is really fast compared to other more flexible implementations because the fact that only two thread can simultaneously read / write
 * means that less synchronization is needed for each operation.
 * The queue supports ranged push / pop to insert multiple elements without performance impact
 *
 * @tparam Type to be inserted
 */
template<typename Type>
class kF::Core::SPSCQueue
{
public:
    /** @brief Buffer structure containing all cells */
    struct Buffer
    {
        Type *data { nullptr };
        std::size_t capacity { 0 };
    };

    /** @brief Local thread cache */
    struct Cache
    {
        Buffer buffer {};
        std::size_t value { 0 };
    };

    /** @brief Default constructor initialize the queue */
    SPSCQueue(const std::size_t capacity);

    /** @brief Destruct and release all memory (unsafe) */
    ~SPSCQueue(void) { clear(); std::free(_buffer.data); }

    /** @brief Push a single element into the queue
     *  @return true if the element has been inserted */
    template<typename ...Args>
    [[nodiscard]] inline bool push(Args &&...args);

    /** @brief Pop a single element from the queue
     *  @return true if an element has been extracted */
    [[nodiscard]] inline bool pop(Type &value);

    /** @brief Clear all elements of the queue (unsafe) */
    void clear(void);

private:
    KF_ALIGN_CACHELINE std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
    struct {
        Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
        std::size_t _headCache { 0 }; // Head cache for the producer
        char _pad0[Utils::CacheLineSize - sizeof(Buffer) - sizeof(std::size_t)];
    };

    KF_ALIGN_CACHELINE std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
    struct{
        Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
        std::size_t _tailCache { 0 }; // Head cache for the consumer
        char _pad1[Utils::CacheLineSize - sizeof(Buffer) - sizeof(std::size_t)];
    };

    /** @brief Copy and move constructors disabled */
    SPSCQueue(const SPSCQueue &other) = delete;
    SPSCQueue(SPSCQueue &&other) = delete;
};

static_assert(sizeof(kF::Core::SPSCQueue<int>) == 4 * kF::Core::Utils::CacheLineSize);

template<typename Type>
kF::Core::SPSCQueue<Type>::SPSCQueue(const std::size_t capacity)
{
    _buffer.capacity = capacity;
    if (_buffer.data = reinterpret_cast<Type *>(std::malloc(sizeof(Type) * capacity)); !_buffer.data)
        throw std::runtime_error("Core::SPSCQueue: Malloc failed");
    _buffer2 = _buffer;
}

template<typename Type>
template<typename ...Args>
bool kF::Core::SPSCQueue<Type>::push(Args &&...args)
{
    static_assert(std::is_constructible<Type, Args...>::value, "Type must be constructible from Args...");

    const auto tail = _tail.load(std::memory_order_relaxed);
    auto next = tail + 1;

    if (next == _buffer.capacity) [[unlikely]]
        next = 0;
    if (auto head = _headCache; next == head) [[unlikely]] {
        head = _headCache = _head.load(std::memory_order_acquire);
        if (next == head) [[unlikely]]
            return false;
    }
    new (_buffer.data + tail) Type{ std::forward<Args>(args)... };
    _tail.store(next, std::memory_order_release);
    return true;
}

template<typename Type>
bool kF::Core::SPSCQueue<Type>::pop(Type &value)
{
    const auto head = _head.load(std::memory_order_relaxed);

    if (auto tail = _tailCache; head == tail) [[unlikely]] {
        tail = _tailCache = _tail.load(std::memory_order_acquire);
        if (head == tail) [[unlikely]]
            return false;
    }
    auto *elem = reinterpret_cast<Type *>(_buffer2.data + head);
    auto next = head + 1;
    if (next == _buffer2.capacity) [[unlikely]]
        next = 0;
    value = std::move(*elem);
    elem->~Type();
    _head.store(next, std::memory_order_release);
    return true;
}

template<typename Type>
void kF::Core::SPSCQueue<Type>::clear(void)
{
    for (Type type; pop(type););
}

The benchmark, using google benchmark. bench_SPSCQueue.cpp:

#include <thread>

#include <benchmark/benchmark.h>

#include "SPSCQueue.hpp"

using namespace kF;

using Queue = Core::SPSCQueue<std::size_t>;

constexpr std::size_t Capacity = 4096;

static void SPSCQueue_NoisyPush(benchmark::State &state)
{
    Queue queue(Capacity);
    std::atomic<bool> running = true;
    std::size_t i = 0ul;
    std::thread thd([&queue, &running] { for (std::size_t tmp; running; benchmark::DoNotOptimize(queue.pop(tmp))); });
    for (auto _ : state) {
        decltype(std::chrono::high_resolution_clock::now()) start;
        do {
            start = std::chrono::high_resolution_clock::now();
        } while (!queue.push(42ul));
        auto end = std::chrono::high_resolution_clock::now();
        auto elapsed = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
        auto iterationTime = elapsed.count();
        state.SetIterationTime(iterationTime);
    }
    running = false;
    if (thd.joinable())
        thd.join();
}
BENCHMARK(SPSCQueue_NoisyPush)->UseManualTime();

static void SPSCQueue_NoisyPop(benchmark::State &state)
{
    Queue queue(Capacity);
    std::atomic<bool> running = true;
    std::size_t i = 0ul;
    std::thread thd([&queue, &running] { while (running) benchmark::DoNotOptimize(queue.push(42ul)); });
    for (auto _ : state) {
        std::size_t tmp;
        decltype(std::chrono::high_resolution_clock::now()) start;
        do {
            start = std::chrono::high_resolution_clock::now();
        } while (!queue.pop(tmp));
        auto end = std::chrono::high_resolution_clock::now();
        auto elapsed = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
        auto iterationTime = elapsed.count();
        state.SetIterationTime(iterationTime);
    }
    running = false;
    if (thd.joinable())
        thd.join();
}
BENCHMARK(SPSCQueue_NoisyPop)->UseManualTime();

Extreme answered 2/9, 2020 at 13:31 Comment(16)
Have you checked godbolt to see the ASM the compiler outputs?Hagberry
@SombreroChicken No I didn't, I'm not very comfortable with ASM, but I'll definitly give it a try, thank you for the advice.Extreme
Try playing with your code here and see what the ASM does when you swap your declarations around.Hagberry
I'll throw in a bounty if this sees no activity.Hagberry
@SombreroChicken After digging into the ASM, for now I only know that in some cases GCC will use either the 'MOVUPS' (unaligned) or 'MOVAPS' (aligned) instruction. It seems that an unaligned instruction can introduce extra overhead but it doesn't explain why the benchmark vary that much between usages.Extreme
are you sure struct alignas(64) works? I thought normally you'd put the alignas before the struct keyword. I'd guess that your anonymous struct means something different for alignment than you thought.Buffy
@PeterCordes When using on structure declaration, 'alignas' keyword must be at the right side of 'struct' or 'class' keyword. The anonymous struct doesn't affect performance. I use it because it's the only way to align two variables at once. Each example shown is supposed to be equivalent at the C++ logical level, but it seems that it's not the case at the compiler level.Extreme
Align which two variables at once? You already have alignas(64) on the first variable after the struct. An aligned struct means the first member will be aligned, and the size of the struct will be a multiple of the alignment so the first variable after the struct will be aligned. But the 2nd struct member gets no such guarantee. (Your non-struct way doesn't try to separate that from the buffer so I don't think you're trying to align it.)Buffy
@PeterCordes What do you mean by "the first member will be aligned, and the size of the struct will be a multiple of the alignment" ? In my mind, the only thing alignas would do is add padding to a variable / structure to fill a specific boundary size (and all tests I've done shown that the structure size is equal for each benchmark). Do I missunderstood anything here ?Extreme
@MatthieuMoinvaziri You can use offsetof to get more info about structure memory layout. I've played a little with yours, even though I don't know what to look for. See added static_asserts under SPSCQueue in godbolt.org/z/M7Mdo7 My understanding for having sizeof(Queue) == 4 * kF::Core::Utils::CacheLineSize is to have memory in 4 caches: atomic0, data0, atomic1, data1. It does not seem to be the case, based on offsetofsMudskipper
@Mudskipper In the static_asserts your added, you checked that 'data0' is in the same cacheline of 'atomic0'. However, it isn't the case: as you said, I've got 4 caches on 4 different cachelines. Anyway, I didn't thinked to used offsetof to verify each members so thank you, but it seems not to be the root of the issue.Extreme
After digging for a while and thanks to your comments I realized the root of the issue was the overhaul Queue structure alignment. In fact, because both consumer and producer threads have both 128 bytes of used memory, the SPSCQueue structure needed to be aligned at 128 bytes, not 64. When I align the structure to 128, each configuration gives the same benchmarking results. If somebody have the exact knowledge of why aligning the Queue would matter from 64 to 128bytes when the cacheline size is 64bytes, I would really like to know.Extreme
Intel's L2 spatial prefetcher tries to complete a 128-byte aligned pair of cache lines after loading one of the other of a pair.Buffy
Write a self answerHagberry
When you apply alignas(64) to members, char _pad is unnecessary, the compiler inserts padding for you.Louannlouanna
@MaximEgorushkin Yes you are right, that's why I was very confuse when I saw that it would give differents résulte but it only was due yo structure alignement, not membersExtreme
E
1

Thanks to your useful comments (and mostly, thanks to Peter Cordes), it seems that the issue was coming from the L2 data prefetcher.

Because of my SPSC queue design, each thread must access two consecutive cachelines to push / pop the queue. If the structure itself is not aligned to 128 bytes, its address will not be aligned on 128 bytes and the compiler will not be able to optimize out the access of the two aligned cacheline.

Thus, the simple fix is:

template<typename Type>
class alignas(128) SPSCQueue { ... };

Here (section 2.5.5.4 Data Prefetching) is a interesting paper from Intel explaining optimization on their architectures and how prefetching is done in various levels of caches.

Extreme answered 23/9, 2020 at 13:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.