Should volatile still be used for sharing data with ISRs in modern C++?

Asked 18/8, 2020 at 15:0 Answered 20/8, 2020 at 13:37

I've seen some flavors of these question around and I've seen mixed answers, still unsure whether they are up-to-date and fully apply to my use case, so I'll ask here. Do let me know if it's a duplicate!

Given that I'm developing for STM32 microcontrollers (bare-metal) using C++17 and the gcc-arm-none-eabi-9 toolchain:

Do I still need to use volatile for sharing data between an ISR and main()?

volatile std::int32_t flag = 0;

extern "C" void ISR()
{
    flag = 1;
}

int main()
{
    while (!flag) { ... }
}

It's clear to me that I should always use volatile for accessing memory-mapped HW registers.

However for the ISR use case I don't know if it can be considered a case of "multithreading" or not. In that case, people recommend using C++11's new threading features (e.g. std::atomic). I'm aware of the difference between volatile (don't optimize) and atomic (safe access), so the answers suggesting std::atomic confuse me here.

For the case of "real" multithreading on x86 systems I haven't seen the need to use volatile.

In other words: can the compiler know that flag can change inside ISR? If not, how can it know it in regular multithreaded applications?

Thanks!

Lisabeth answered 18/8, 2020 at 15:0 Comment(9)

You've to use volatile to tell the compiler that inside main flag might get changed without notice by the compiler. std::atomic is also fine but in this case it's not really needed. – Fairy 18/8, 2020 at 16:26

@HS2: When using clang/gcc, if one doesn't use either atomic or a clang/gcc "__asm" intrinsic, operations on the volatile data-ready flag might get reordered with respect to operations on the buffer the flag was being used to guard. – Tripetalous 18/8, 2020 at 16:53

@Tripetalous That’s right and reordering is not covered by the standard, only sequential consistency. But if I’m not wrong that wasn’t the original question. – Fairy 19/8, 2020 at 19:29

@Tripetalous And yes, when it comes to, say, semaphore/mutex semantics, potential reordering, speculative execution and prefetching have to be taken into account. – Fairy 19/8, 2020 at 19:39

@HS2: On a single core system, when using a compiler that treats volatile as a global barrier to compiler reordering, volatile will work reliably for coordinating actions with ISRs. When using clang and gcc, volatile semantics are too weak to be suitable for that purpose without also using memory-clobber intrinsics. – Tripetalous 19/8, 2020 at 19:59

There is also the standard sig_atomic_t which is

the (possibly volatile-qualified) integer type of an object that can be accessed as an atomic entity, even in the presence of asynchronous interrupts

. – Smashed 20/8, 2020 at 13:49

@KamilCuk: That tends to be of somewhat limited usefulness, since most implementations can offer semantic guarantees that are stronger than what the Standard requires, and many tasks would be impractical, if not outright impossible, without such guarantees. – Tripetalous 20/8, 2020 at 20:14

For the case of "real" multithreading on x86 systems I haven't seen the need to use volatile. Huh? Your code with a stop_running flag is a textbook example of code that breaks with -O2 with the flag-setting done from another thread. Multithreading program stuck in optimized mode but runs normally in -O0 / MCU programming - C++ O2 optimization breaks while loop . You need std::atomic<bool> (optionally with std::memory_order_relaxed), or for sig/int handlers you can weaken that to volatile sig_atomic_t – Clyde 5/1, 2023 at 12:52

Yes, I meant that I can use atomic instead of volatile (which I don't need anywhere in x86 user-level programming, as opposed to bare-metal programming) – Lisabeth 5/1, 2023 at 16:11

I think that in this case both volatile and atomic will most likely work in practice on the 32 bit ARM. At least in an older version of STM32 tools I saw that in fact the C atomics were implemented using volatile for small types.

Volatile will work because the compiler may not optimize away any access to the variable that appears in the code.

However, the generated code must differ for types that cannot be loaded in a single instruction. If you use a volatile int64_t, the compiler will happily load it in two separate instructions. If the ISR runs between loading the two halves of the variable, you will load half the old value and half the new value.

Unfortunately using atomic<int64_t> may also fail with interrupt service routines if the implementation is not lock free. For Cortex-M, 64-bit accesses are not necessarily lockfree, so atomic should not be relied on without checking the implementation. Depending on the implementation, the system might deadlock if the locking mechanism is not reentrant and the interrupt happens while the lock is held. Since C++17, this can be queried by checking atomic<T>::is_always_lock_free. A specific answer for a specific atomic variable (this may depend on alignment) may be obtained by checking flagA.is_lock_free() since C++11.

So longer data must be protected by a separate mechanism (for example by turning off interrupts around the access and making the variable atomic or volatile.

So the correct way is to use std::atomic, as long as the access is lock free. If you are concerned about performance, it may pay off to select the appropriate memory order and stick to values that can be loaded in a single instruction.

Not using either would be wrong, the compiler will check the flag only once.

These functions all wait for a flag, but they get translated differently:

#include <atomic>
#include <cstdint>

using FlagT = std::int32_t;

volatile FlagT flag = 0;
void waitV()
{
    while (!flag) {}
}

std::atomic<FlagT> flagA;
void waitA()
{
    while(!flagA) {}    
}

void waitRelaxed()
{
    while(!flagA.load(std::memory_order_relaxed)) {}    
}

FlagT wrongFlag;
void waitWrong()
{
    while(!wrongFlag) {}
}

Using volatile you get a loop that reexamines the flag as you wanted:

waitV():
        ldr     r2, .L5
.L2:
        ldr     r3, [r2]
        cmp     r3, #0
        beq     .L2
        bx      lr
.L5:
        .word   .LANCHOR0

Atomic with the default sequentially consistent access produces synchronized access:

waitA():
        push    {r4, lr}
.L8:
        bl      __sync_synchronize
        ldr     r3, .L11
        ldr     r4, [r3, #4]
        bl      __sync_synchronize
        cmp     r4, #0
        beq     .L8
        pop     {r4}
        pop     {r0}
        bx      r0
.L11:
        .word   .LANCHOR0

If you do not care about the memory order you get a working loop just as with volatile:

waitRelaxed():
        ldr     r2, .L17
.L14:
        ldr     r3, [r2, #4]
        cmp     r3, #0
        beq     .L14
        bx      lr
.L17:
        .word   .LANCHOR0

Using neither volatile nor atomic will bite you with optimization enabled, as the flag is only checked once:

waitWrong():
        ldr     r3, .L24
        ldr     r3, [r3, #8]
        cmp     r3, #0
        bne     .L23
.L22:                        // infinite loop!
        b       .L22
.L23:
        bx      lr
.L24:
        .word   .LANCHOR0
flag:
flagA:
wrongFlag:

Stationary answered 18/8, 2020 at 16:24 Comment(18)

Interesting answer, but can the Godbolt gcc ARM compiler be trusted to generate the same code as gcc "ARM none EABI" used by STM32 bare metal tool chains? – Marleen 19/8, 2020 at 8:27

You can and should always run your-gcc -S to see the actual assembly output, or disassemble with objdump. Note also that your compilation for STM32 probably contains a significant number of additional flags, I just added what I could remember on the spot. The point is, with atomic the compiler must make sure that concurrent access works, with volatile the guarantee is different – Stationary 19/8, 2020 at 9:54

If a platform has no natural way of handling 64-bit operations atomically, an implementation's "atomic" features are unlikely to work reliably in conjunction with interrupts unless they can save the interrupt state, disable interrupts, perform the operation, and restore the interrupt state. If temporarily disabling interrupts would be acceptable, user code should be able to do that without need for an implementation's "atomic" features, and use the resulting semantics to do various things more easily than would be possible with "atomic". – Tripetalous 19/8, 2020 at 20:6

I agree that atomic is the way to go, but see your argument about volatile int64_t as false - you cannot use atomic<int64_t> either (if its is_lock_free() is false). That would either use mutex (blocking the IRQ/ISR indefinitely) or LL-SC (which is bad idea to do in IRQ because LL-SC typically cannot be nested, break the logic if you do it). – Expiatory 20/8, 2020 at 14:2

@firda: I don't think the Standard makes clear whether atomics that use ll/sc of the target type are supposed to indicate is_lock_free(). It's generally not possible for a compiler to guarantee that such operations will be technically lock free, but in practice they can often be guaranteed to make progress if a system ever manages to execute more than a few instructions between interrupts. For many purposes what's more important are that operations be obstruction free, and that they use the same locking mechanism as anything else on the system that needs to be atomic. – Tripetalous 20/8, 2020 at 19:49

@supercat: LL-SC (LDREX/STREX) is spinlock, that is not lock-free. You either use atomic_flag which is the only guaranteed thing to work in ISR, or you need to make it platform-specific. There I bet on atomit_int when needed, because volatile may not be enough, memory clobber may not be enough (may need DMB or DSB instructions). so that atomit either does it right or it is simply not possible at all. (and you can add some static_assert or use ATOMIC_INT_LOCKFREE. – Expiatory 22/8, 2020 at 7:43

@firda: In many systems, the circumstances necessary to cause LL/SC to live-lock could never occur, though an implementation may have no way of knowing that. What's needed is a way for someone who knows the semantics of the underlying platform to have a consistent compiler-independent way of indicating those in the language--something for which C used to be good but has gotten progressively worse as compiler writers have lost sight of the fact that what made C useful was not the anemic abstraction model of the standard, but that the language could adapt to many abstraction models. – Tripetalous 22/8, 2020 at 8:8

@firda: If one can't use is_lock_free to determine whether an implementation claims to use a platform's native semantics for atomic operations, what means should one use? Whether one needs a DMB or DSB depends upon the core and whether one is interacting with interrupts or with things like DMA that can alter memory without the core's involvement. Programmers will often know such things when compilers can't. – Tripetalous 22/8, 2020 at 8:15

@supercat: read this en.cppreference.com/w/cpp/atomic/atomic/is_always_lock_free and this en.cppreference.com/w/c/atomic/ATOMIC_LOCK_FREE_consts Practically you either find lock-free solution or you have to disable interrupts. (And about your In many systems, ... not true for STM32 in question, you must use STREX with same address as last LDREX or you break the contract = UB = never do that in ISR). – Expiatory 22/8, 2020 at 8:27

@firda: On the Cortex-M3, if an interrupt context switch occurs between an LDREX and STREX, it is guaranteed to invalidate the pending LDREX, so a subsequent STREX will report failure. If the time between an LDREX and STREX is sufficiently long that an interrupt will always occur between them, the STREX will never succeed, but if there ever will be a long enough time without interrupts, the LDREX/STREX loop will run until then. – Tripetalous 22/8, 2020 at 9:21

@supercat: static.docs.arm.com/dui0553/a/DUI0553A_cortex_m4_dgug.pdf - page 83: The result of executing a Store-Exclusive instruction to an address that is different from that used in the preceding Load-Exclusive instruction is unpredictable. – Expiatory 22/8, 2020 at 10:9

@supercat: P.S.: I see no real reason why the HW would not remember the last address used and make STREX fail if used with different, but that document states otherwise. I see no way to even implement thread-switching correctly if STREX was so broken. I would love it to work properly, but... seen HW not doing what you would expect way too often. Anyway, if you have beter document, please shere. Otherwise we should either move to chat, or leave this topic open. – Expiatory 22/8, 2020 at 10:40

See developer.arm.com/documentation/dui0552/a/… for information about ldrex/strex. Note in particular that processing an exception (interrupt) clears the exclusive-access flag, so on a Cortex-M3 the basic effect of "strex" is "perform the store unless an interrupt has occurred since the ldrex". BTW, I find myself curious why strex doesn't set flags, since code is almost certainly going to be interested in branching on whether it succeeded or failed. – Tripetalous 22/8, 2020 at 17:58

Very interesting discussion, I know realize I have a huge lot more to learn on the topic! Thanks a lot :) I greatly appreciate the example and the methodology - inspect the assembly to be 100% sure. I wanted to know mostly if modern C++ compilers in 2020 would have already figured this out, but turns out they haven't (perhaps they never will?). For thread-safety I'll probably go for disable/enable interrupts for now, since I actually want to read arrays instead of 32-bit flags. @Marleen Godbolt does support arm-none-eabi :) godbolt.org/z/hdxz4b – Lisabeth 22/8, 2020 at 22:14

@firda: There are a couple approaches a system can use for implementing something like LDREX/STREX: watch the address and make the STREX fail if anything happens to it, or else watch for anything "suspicious" happening and make the STREX fail if it does. The latter approach is simpler and easier to implement, but would work extremely poorly, if not unusably, in a multi-core system. A difference I don't remember whether the Cortex documentation mentioned is that when using the former approach, something like ... – Tripetalous 24/8, 2020 at 14:45

ldrex r1,[r0] / str r1,[r2] / strex r2,r1,[r0] would result in the strex reporting failure if r0 and r2 are equal (because of the store to the r0/r2 address between the ldrex and strex) but when using the latter approach the strex would likely overwrite the value written by the str (unless an interrupt happened to occur between the ldrex/strex). – Tripetalous 24/8, 2020 at 14:52

@firda: Of course, that leaves open the question of whether any/all versions of clang/gcc would refrain from other memory operations across ldrex and strex. If e.g. code 'ldrex'es a list head pointer, stores it into a new list item's "next" pointer, and then 'strex'es the list head pointer to the new item, having a compiler defer the update of the list item's "next" pointer past the strex could result in a wrong "next" pointer being read from the new item. – Tripetalous 24/8, 2020 at 14:57

@supercat: Was searching a bit more and 1. I can confirm that clrex is auto-executed when interrupted (making following strex fail, making task-switching possible), but 2. any memory access between the two can lead to problems and unexpected/undefined behaviour (Exclusives Reservation Granule), leaving only one reliable usage for these - spinlocks (CAS/RMW). And that leads us back to the ISR deadlock (mutex-lock in ISR). So again: either lock-free atomics (atomic_flag especially) or disabling interrupts. Nothing else is reliable (in general, vendors can give better guarantees). – Expiatory 26/8, 2020 at 8:5

To understand the issue, you must first understand why volatile is needed in the first place.

There are three completely separate issues here:

Incorrect optimizations because the compiler doesn't realize that hardware callbacks such as ISRs are actually called.

Solution: volatile or compiler awareness.
Re-entrancy and race condition bugs caused by accessing a variable in several instructions and getting interrupted in the middle of it by an ISR using the same variable.

Solution: protected or atomic access with mutex, _Atomic, disabled interrupts etc.
Parallelism or pre-fetch cache bugs caused by instruction re-ordering, multi-core execution, branch prediction.

Solution: memory barriers or allocation/execution in memory areas that aren't cached. volatile access may or may not act as a memory barrier on some systems.

As soon as someone brings this kind of question up of SO, you always get lots of PC programmers babbling about 2 and 3 without knowing or understanding anything about 1. This is because they have never in their life written an ISR and PC compilers with multi-threading are generally aware that thread callbacks will get executed, so this isn't typically an issue in PC programs.

What you need to do to solve 1) in your case, is to see if the compiler actually generates code for reading while (!flag), with or without optimizations enabled. Disassemble and check.

Ideally, compiler documentation will tell that the compiler understands the meaning of some compiler-specific extension such as the non-standard keyword interrupt and upon spotting it make no assumptions about that function not getting called.

Sadly though, most compilers only use interrupt etc keywords to generate the right calling convention and return instructions. I recently encountered the missing volatile bug just a few weeks ago, upon helping someone on a SE site and they were using a modern ARM tool chain. So I don't trust compilers to handle this still, in the year 2020, unless they explicitly document it. When in doubt use volatile.

Regarding 2) and re-entrancy, modern compilers do support _Atomic nowadays, which makes things very easy. Use it is it's available and reliable on your compiler. Otherwise, for most bare metal systems you can utilize the fact that interrupts are non-interruptable and use a plain bool as a "mutex lite" (example), as long as there is no instruction re-ordering (unlikely case for most MCUs).

But please note that 2) is a separate issue not related to volatile. volatile does not solve thread-safe access. Thread-safe access does not solve incorrect optimizations. So don't mix these two unrelated concepts up in the same mess, as often seen on SO.

Marleen answered 19/8, 2020 at 8:19 Comment(1)

Comments are not for extended discussion; this conversation has been moved to chat. – Garvy 22/8, 2020 at 19:51

Of the commercial compilers I've tested that weren't based on gcc or clang, all of them would treat a read or write via volatile pointer or lvalue as being capable of accessing any other object, without regard for whether it would seem possible for the pointer or lvalue to hit the object in question. Some, such as MSVC, formally documented the fact that volatile writes have release semantics and volatile reads have acquire semantics, while others would require a read/write pair to achieve acquire semantics.

Such semantics make it possible to use volatile objects to build a mutex that can guard "ordinary" objects on systems with a strong memory model (including single-core systems with interrupts), or on compilers that apply acquire/release barriers at the hardware memory ordering level rather than merely the compiler ordering level.

Neither clang or gcc, however, offers any option other than -O0 which would offer such semantics, since they would impede "optimizations" that would otherwise be able to convert code that performs seemingly-redundant loads and stores [that are actually needed for correct operation] into "more efficient" code [that doesn't work]. To make one's code usable with those, I would recommend defining a 'memory clobber' macro (which for clang or gcc would be asm volatile ("" ::: "memory");) and invoking it between the action which needs to precede a volatile write and the write itself, or between a volatile read and the first action which would need to follow it. If one does that, that would allow one's code to be readily adapted to implementations that would neither support nor require such barriers, simply by defining the macro as an empty expansion.

Note that while some compilers interpret all asm directives as a memory clobber, and there wouldn't be any other purpose for an empty asm directive, gcc simply ignores empty asm directives rather than interpreting them in such fashion.

An example of a situation where gcc's optimizations would prove problematic (clang seems to handle this particular case correctly, but some others still pose problems):

short buffer[10];
volatile short volatile *tx_ptr;
volatile int tx_count;
void test(void)
{
    buffer[0] = 1;
    tx_ptr = buffer;
    tx_count = 1;
    while(tx_count)
        ;
    buffer[0] = 2;
    tx_ptr = buffer;
    tx_count = 1;
    while(tx_count)
        ;
}

GCC will decide to optimize out the assignment buffer[0]=1; because the Standard doesn't require it to recognize that storing the buffer's address into a volatile might have side effects that would interact with the value stored there.

[edit: further experimentation shows that icc will reorder accesses to volatile objects, but since it reorders them even with respect to each other, I'm not sure what to make of that, since that would seem broken by any imaginable interpretation of the Standard].

Tripetalous answered 18/8, 2020 at 16:49 Comment(0)

Short answer: always use std::atomic<T> whose is_lock_free() returns true.

Reasoning:

volatile can work reliably on simple architectures (single-core, no cache, ARM/Cortex-M) like STM32F2 or ATSAMG55 with e.g. IAR compiler. But...
It may fail to work as expected on more complex architectures (multi-core with cache) and when compiler tries to do certain optimisations (many examples in other answers, won't repeat that).
atomic_flag and atomic_int (if is_lock_free() which they should) are safe to use anywhere, because they work like volatile with added memory bariers / synchronization when needed (avoiding the problems in previous point).
The reason I specifically said you have to only use those with is_lock_free() being true is because you cannot stop IRQ as you could stop a thread. No, IRQ interrupts main loop and does its job, it cannot wait-lock on a mutex because it is blocking the main loop it would be waiting for.

Practical note: I personally either use atomic_flag (the one and only guaranteed to work) to implement sort of spin-lock, where ISR will disable itself when finding the lock locked, while main loop will always re-enable the ISR after unlocking. Or I use double-counter lock-free queue (SPSC - single producer, single consumer) using that atomit_int. (Have one reader-counter and one writer-counter, subtract to find the real count. Good for UART etc.)

Expiatory answered 20/8, 2020 at 13:37 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags