Fast and Lock Free Single Writer, Multiple Reader

Asked 10/1, 2019 at 9:51 Answered 10/1, 2019 at 16:15

I've got a single writer which has to increment a variable at a fairly high frequence and also one or more readers who access this variable on a lower frequency.

The write is triggered by an external interrupt.

Since i need to write with high speed i don't want to use mutexes or other expensive locking mechanisms.

The approach i came up with was copying the value after writing to it. The reader now can compare the original with the copy. If they are equal, the variable's content is valid.

Here my implementation in C++

template<typename T>
class SafeValue
{
private:
    volatile T _value;
    volatile T _valueCheck;
public:
    void setValue(T newValue)
    {
        _value = newValue;
        _valueCheck = _value;
    }

    T getValue()
    {
        volatile T value;
        volatile T valueCheck;
        do
        {
            valueCheck = _valueCheck;
            value = _value;
        } while(value != valueCheck);

        return value;
    }
}

The idea behind this is to detect data races while reading and retry if they happen. However, i don't know if this will always work. I haven't found anything about this aproach online, therefore my question:

Is there any problem with my aproach when used with a single writer and multiple readers?

I already know that high writing frequencys may cause starvation of the reader. Are there more bad effects i have to be cautious of? Could it even be that this isn't threadsafe at all?

Edit 1:

My target system is a ARM Cortex-A15.

T should be able to become at least any primitive integral type.

Edit 2:

std::atomic is too slow on reader and writer site. I benchmarked it on my system. Writes are roughly 30 times slower, reads roughly 50 times compared to unprotected, primitive operations.

Thespian answered 10/1, 2019 at 9:51 Comment(9)

Comments are not for extended discussion; this conversation has been moved to chat. – Tapioca 10/1, 2019 at 10:59

Do you really need to wite at high speed? Couldn't you increment a writer-private variable, and then just add it to the reader-accessible total periodically, as opposed to having to increment the total by 1 frequently? In other words, can the readers tolerate a little bit of staleness? – Spence 10/1, 2019 at 11:36

@Branko Dimitrijevic - Not much. The counter is used to track the position of an stepper motor. The reader requires to get this motor's position as exact as possible. – Thespian 10/1, 2019 at 11:40

@Thespian I would recommend to edit the question to clearly explain in details why atomics are not suitable for you. As you write in the comment below: The problem with std::atomic is that internal system calls are greatly slowing it down. What internal system calls? Atomic operations should boil down to special instructions only if atomics are supoorted by hardware for a given data type (is it your case or not?). What do you mean by slowing it down? Slowing producer? Slowing consumers? How do you measure such slowing? – Bearable 10/1, 2019 at 12:12

@Daniel Langr - I benchmarked it on the target system. Edited my question. – Thespian 10/1, 2019 at 12:29

@Thespian Thanks. I am still not sure whether I understand the whole problem, since when I compare the generated assembly of your solution with one that uses std::atomic<T> (for int type, ARM, and GCC 8.2), the latter one is much simpler and should be more efficient: godbolt.org/z/lpvFQB. BTW, how do you measure the slowdown? – Bearable 10/1, 2019 at 14:47

Just out of curiosity, I also tried to substitute volatile in your soluton by atomics: godbolt.org/z/tl4wdB. The generated assembly is very similar and I doubt there would be any such significant performance difference (note that there is no reason for those local variables in getValue to be volatile). – Bearable 10/1, 2019 at 15:0

You benchmarked std::atomic with what ordering constraint? The default sequential consistency? Did you profile with release/acquire or release/consume ordering? Did you look at the generated assembly for those cases? – Darcee 10/1, 2019 at 16:25

What you are doing is very similar to what I have seen called "sequence lock". Here is another thread with some details. https://mcmap.net/q/18904/-is-volatile-required-here – Compliancy 12/1, 2019 at 12:44

You should try using std::atomic first, but make sure that your compiler knows and understands your target architecture. Since you are targeting Cortex-A15 (ARMv7-A cpu), make sure to use -march=armv7-a or even -mcpu=cortex-a15.

The first shall generate ldrexd instruction which should be atomic according to ARM docs:

Single-copy atomicity

In ARMv7, the single-copy atomic processor accesses are:

all byte accesses

all halfword accesses to halfword-aligned locations

all word accesses to word-aligned locations

memory accesses caused by LDREXD and STREXD instructions to doubleword-aligned locations.

The latter shall generate ldrd instruction which should be atomic on targets supporting Large Physical Address Extension:

In an implementation that includes the Large Physical Address Extension, LDRD and STRD accesses to 64-bit aligned locations are 64-bit single-copy atomic as seen by translation table walks and accesses to translation tables.

--- Note ---

The Large Physical Address Extension adds this requirement to avoid the need to complex measures to avoid atomicity issues when changing translation table entries, without creating a requirement that all locations in the memory system are 64-bit single-copy atomic.

You can also check how Linux kernel implements those:

#ifdef CONFIG_ARM_LPAE
static inline long long atomic64_read(const atomic64_t *v)
{
    long long result;

    __asm__ __volatile__("@ atomic64_read\n"
"   ldrd    %0, %H0, [%1]"
    : "=&r" (result)
    : "r" (&v->counter), "Qo" (v->counter)
    );

    return result;
}
#else
static inline long long atomic64_read(const atomic64_t *v)
{
    long long result;

    __asm__ __volatile__("@ atomic64_read\n"
"   ldrexd  %0, %H0, [%1]"
    : "=&r" (result)
    : "r" (&v->counter), "Qo" (v->counter)
    );

    return result;
}
#endif

Skirret answered 10/1, 2019 at 16:15 Comment(0)

Is this single variable just an integer, pointer, or plain old value type, you can probably just use std::atomic.

Lurdan answered 10/1, 2019 at 10:2 Comment(11)

std::atomic is too slow, even then lock-free which is not always guaranteed. – Thespian 10/1, 2019 at 10:5

@Thespian Your claims make no sense. Study the assembly generated by std::atomic. – Purchasable 10/1, 2019 at 10:38

@MaximEgorushkin Not really if you read the question. He just wants to avoid the CAS-loop. The problem he is trying to solve CAN be solved faster than std::atomic as his task is much more specific. – Skirret 10/1, 2019 at 10:39

@Ivan There is one writer only, the shared variable is int, no CAS loop is necessary. – Purchasable 10/1, 2019 at 10:44

@MaximEgorushkin Reading 8-byte values on 32-bit platforms is implemented using CAS-loop e.g. will fail on read-only pages for example. I think he wants to speed-up reader side. The problem is that his code is not "lock-free" at all. – Skirret 10/1, 2019 at 10:45

@Ivan Where did you pull numbers 8 and 32 from? – Purchasable 10/1, 2019 at 10:47

@MaximEgorushkin From his comments. Although there is a lengthy 30-comment discussion now. He needs this to work for 8 byte values on 32-bit ARM. – Skirret 10/1, 2019 at 10:48

@Ivan: The OP's Cortex-A15 (ARMv7) doesn't have CAS in hardware, it has LL/SC. And it can do an atomic 64-bit load or store using LDREXD, or LDREXD+STREXD. Are you maybe thinking of 32-bit x86 with cmpxchg8b? – Cumulostratus 10/1, 2019 at 15:15

@PeterCordes GCC generates CAS (__sync_val_compare_and_swap_8 call) to read 8 byte value for 32-bit ARM - link. I think you will need both LDREXD and STREXD to implement it. Similar is done for x86, but it will be 8-byte cmpxchg instruction, yes. – Skirret 10/1, 2019 at 15:47

@Ivan: You forgot to tell the compiler it was targeting an ARMv7-a CPU, specifically -mcpu=cortex-a15. godbolt.org/z/q-S5x1. In your link, gcc was compiling for any generic ARM, and had to emit code that worked on anything, including ARMv5 and earlier before LDREX / STREX even existed, and before ldrd was guaranteed atomic anywhere, so it couldn't inline them. – Cumulostratus 10/1, 2019 at 16:7

@PeterCordes Indeed, I have checked ARM specs on that. Surprise that by default such slow implementation is used. Linux doesn't use CAS at all when implementing 64-bit accesses. – Skirret 10/1, 2019 at 16:16

The first shall generate ldrexd instruction which should be atomic according to ARM docs:

Single-copy atomicity

In ARMv7, the single-copy atomic processor accesses are:

all byte accesses

all halfword accesses to halfword-aligned locations

all word accesses to word-aligned locations

memory accesses caused by LDREXD and STREXD instructions to doubleword-aligned locations.

The latter shall generate ldrd instruction which should be atomic on targets supporting Large Physical Address Extension:

In an implementation that includes the Large Physical Address Extension, LDRD and STRD accesses to 64-bit aligned locations are 64-bit single-copy atomic as seen by translation table walks and accesses to translation tables.

--- Note ---

The Large Physical Address Extension adds this requirement to avoid the need to complex measures to avoid atomicity issues when changing translation table entries, without creating a requirement that all locations in the memory system are 64-bit single-copy atomic.

You can also check how Linux kernel implements those:

#ifdef CONFIG_ARM_LPAE
static inline long long atomic64_read(const atomic64_t *v)
{
    long long result;

    __asm__ __volatile__("@ atomic64_read\n"
"   ldrd    %0, %H0, [%1]"
    : "=&r" (result)
    : "r" (&v->counter), "Qo" (v->counter)
    );

    return result;
}
#else
static inline long long atomic64_read(const atomic64_t *v)
{
    long long result;

    __asm__ __volatile__("@ atomic64_read\n"
"   ldrexd  %0, %H0, [%1]"
    : "=&r" (result)
    : "r" (&v->counter), "Qo" (v->counter)
    );

    return result;
}
#endif

Skirret answered 10/1, 2019 at 16:15 Comment(0)

There's no way anyone can know. You would have to see if either your compiler documents any multi-threaded semantics that would guarantee that this will work or look at the generated assembler code and convince yourself that it will work. Be warned that in the latter case, it is always possible that a later version of the compiler, or different optimizations options or a newer CPU, might break the code.

I'd suggest testing std::atomic with the appropriate memory_order. If for some reason that's too slow, use inline assembly.

Luigiluigino answered 10/1, 2019 at 11:18 Comment(2)

I maily wanted to focus on the algorithm rather than language specific details. And how would using inline assembly provide a viable performance increase? The problem with std::atomic is that internal system calls are greatly slowing it down. – Thespian 10/1, 2019 at 11:25

@Thespian The algorithm is so simple here that it's all down to the specifics of implementation. Using inline assembly would provide a viable performance increase because it would allow you to be 100% sure you were getting what you were asking for so you wouldn't need to code defensively. Defensive coding will likely cost you. – Luigiluigino 10/1, 2019 at 11:46

Another option is to have a buffer of non-atomic values the publisher produces and an atomic pointer to the latest.

#include <atomic>
#include <utility>

template<class T>
class PublisherValue {
    static auto constexpr N = 32;
    T values_[N];
    std::atomic<T*> current_{values_};

public:
    PublisherValue() = default;
    PublisherValue(PublisherValue const&) = delete;
    PublisherValue& operator=(PublisherValue const&) = delete;

    // Single writer thread only.
    template<class U>
    void store(U&& value) {
        T* p = current_.load(std::memory_order_relaxed);
        if(++p == values_ + N)
            p = values_;
        *p = std::forward<U>(value);
        current_.store(p, std::memory_order_release); // (1) 
    }

    // Multiple readers. Make a copy to avoid referring the value for too long.
    T load() const {
        return *current_.load(std::memory_order_consume); // Sync with (1).
    }
};

This is wait-free, but there is a small chance that a reader might be de-scheduled while copying the value and hence read the oldest value while it has been partially overwritten. Making N bigger reduces this risk.

Purchasable answered 10/1, 2019 at 11:3 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags