Does the C++ volatile keyword introduce a memory fence?

Asked 10/10, 2014 at 19:51 Answered 15/6, 2018 at 3:36

I understand that volatile informs the compiler that the value may be changed, but in order to accomplish this functionality, does the compiler need to introduce a memory fence to make it work?

From my understanding, the sequence of operations on volatile objects cannot be reordered and must be preserved. This seems to imply some memory fences are necessary and that there isn't really a way around this. Am I correct in saying this?

There is an interesting discussion at this related question

Jonathan Wakely writes:

... Accesses to distinct volatile variables cannot be reordered by the compiler as long as they occur in separate full expressions ... right that volatile is useless for thread-safety, but not for the reasons he gives. It's not because the compiler might reorder accesses to volatile objects, but because the CPU might reorder them. Atomic operations and memory barriers prevent the compiler and the CPU from reordering

To which David Schwartz replies in the comments:

... There's no difference, from the point of view of the C++ standard, between the compiler doing something and the compiler emitting instructions that cause the hardware to do something. If the CPU may reorder accesses to volatiles, then the standard doesn't require that their order be preserved. ...

... The C++ standard doesn't make any distinction about what does the reordering. And you can't argue that the CPU can reorder them with no observable effect so that's okay -- the C++ standard defines their order as observable. A compiler is compliant with the C++ standard on a platform if it generates code that makes the platform do what the standard requires. If the standard requires accesses to volatiles not be reordered, then a platform the reorders them isn't compliant. ...

My point is that if the C++ standard prohibits the compiler from reordering accesses to distinct volatiles, on the theory that the order of such accesses is part of the program's observable behavior, then it also requires the compiler to emit code that prohibits the CPU from doing so. The standard does not differentiate between what the compiler does and what the compiler's generate code makes the CPU do.

Which does yield two questions: Is either of them "right"? What do actual implementations really do?

Teage answered 10/10, 2014 at 19:51 Comment(18)

It mostly means that the compiler should not keep that variable in a register. Every assignment and read in the source code should correspond to memory accesses in the binary code. – Rives 10/10, 2014 at 19:53

stackoverflow.com/questions/14785639/… – Outgoings 10/10, 2014 at 19:58

I suspect the point is that any memory fence would be ineffective if the value were to be stored in an internal register. I think you still need to take other protective measures in a concurrent situation. – Gilt 10/10, 2014 at 20:31

As far as I know, volatile is used for variables which can be changed by hardware (often used with microcontrollers). It simply means reading the variable can't be done in a different order and can't be optimized away. That's C though, but should be the same in ++. – Mcmahan 11/10, 2014 at 18:26

@Mcmahan I've yet to see a compiler that prevents reads of volatile variables from being optimized away by the CPU caches. Either all these compilers are non-conformant or the standard doesn't mean what you think it means. (The standard does not distinguish between what the compiler does and what the compiler makes the CPU do. It's the compiler's job to emit code that, when run, complies with the standard.) – Lyre 14/10, 2014 at 0:15

@DavidSchwartz Optimized away from what/where? The RAM? Memory mapped IO? – Subinfeudation 2/6, 2018 at 17:44

@Subinfeudation My point is simply that volatile doesn't provide any ordering guarantees and it doesn't prevent reads from being optimized away from any place the platform can optimize them away and still keep volatile useful for those uses the standard requires it to be useful for – Lyre 2/6, 2018 at 17:50

@DavidSchwartz So a reasonable implementation is not expected to guarantee the ptrace semantics of volatile, that is, when a breakpoint is inserted, all volatile variables can be examined and changed with ptrace and if the execution is restarted, all variables truly hold their new values and the behavior of the program of the program is well defined? (and any volatile access is a well defined position in execution, a "well defined position" is one where an exact breakpoint can be set, an exact breakpoint is one that stops exactly those execution that pass at that C/C++ instruction) – Subinfeudation 2/6, 2018 at 17:57

@Subinfeudation I think that's an unreasonable expectation, at least without specific compiler flags. That might require disabling very significant optimizations that 99.9% of code can benefit from. And, of course, the C++ standard doesn't require it. (Consider that when any function is called, the compiler usually doesn't know if that function does or doesn't contain any volatile accesses. So optimizations might have to be disabled even for code that never uses anything volatile.) – Lyre 2/6, 2018 at 18:15

@DavidSchwartz Which optimisation would be prevented by allowing ptrace PEEK/POKE on volatile variables? (and only on volatile variables) Which real compiler doesn't already and has always implemented "ptrace volatile semantic" as I described? How could allowing breakpoints at the point of a volatile access and allowing arbitrary changes to a volatile variable on a stopped thread, affect programs that don't use volatile? A signal handler can already modify any volatile variable on existing compilers AFAIK. – Subinfeudation 8/6, 2018 at 2:10

@DavidSchwartz I think that GCC implements are stronger that "ptrace semantic", as volatile automatic objects are stored in memory. "ptrace" allows volatiles to be in registers as long as no knowledge about the value of those variables is kept. – Subinfeudation 8/6, 2018 at 22:50

@Subinfeudation What current implementation happen to do is not something you should base expectations that purport to be portable on. – Lyre 10/6, 2018 at 21:55

@DavidSchwartz First, that all implementations at least provide "ptrace volatile" shows that it doesn't have an extravagant cost. Second, I don't see how an implementation could provide less. – Subinfeudation 10/6, 2018 at 22:16

@Subinfeudation It shows it doesn't have an extravagant cost on today's hardware. But if it did on future hardware, likely the platform wouldn't pay that cost. I'm not impressed by your second argument from lack of imagination since I've seen those kinds of arguments fail over and over. Lots of early Windows code made similar assumptions about what compilers would never optimize or what "happens to happen" behavior they thought was guaranteed and it cause no end of pain. – Lyre 10/6, 2018 at 22:46

@DavidSchwartz "ptrace semantics" obviously assume that there such thing as debugging based on breakpoints. It seems fair to assume that any high quality CPU would want to provide that, at least in a simulator. C and C++ implementation usually allow you to put breakpoints even where there is no volatile access, syscall, or other strongly external function call (function call that couldn't possibly be inlined). – Subinfeudation 15/6, 2018 at 2:57

@Subinfeudation I really hope you don't make a practice of encouraging programmers to design based on these kinds of assumptions given that it is absolutely and completely unnecessary, provides no benefit whatsoever, and has lead to massive amounts of pain in the all too recent past. – Lyre 23/6, 2018 at 0:9

@DavidSchwartz I actually do. "ptrace semantic" is the cleanest way to explain volatile and very easy to use in practice. It provides clear benefits for many purposes: testing, writing signal handlers, writing MT code with consume semantics where possible... – Subinfeudation 23/6, 2018 at 0:59

It was not the case in C++98/03 and I think it will never be true. In C# by contrast the volatile imply a memory fence. – Revareval 30/9, 2022 at 23:0

Rather than explaining what volatile does, allow me to explain when you should use volatile.

When inside a signal handler. Because writing to a volatile variable is pretty much the only thing the standard allows you to do from within a signal handler. Since C++11 you can use std::atomic for that purpose, but only if the atomic is lock-free.
When dealing with setjmp according to Intel.
When dealing directly with hardware and you want to ensure that the compiler does not optimize your reads or writes away.

For example:

volatile int *foo = some_memory_mapped_device;
while (*foo)
    ; // wait until *foo turns false

Without the volatile specifier, the compiler is allowed to completely optimize the loop away. The volatile specifier tells the compiler that it may not assume that 2 subsequent reads return the same value.

Note that volatile has nothing to do with threads. The above example does not work if there was a different thread writing to *foo because there is no acquire operation involved.

In all other cases, usage of volatile should be considered non-portable and not pass code review anymore except when dealing with pre-C++11 compilers and compiler extensions (such as msvc's /volatile:ms switch, which is enabled by default under X86/I64).

Reposeful answered 10/10, 2014 at 21:44 Comment(14)

It's stricter than "may not assume that 2 subsequent reads return the same value". Even if you only read once and/or throw the value(s) away, the read has to be done. – Grogan 11/10, 2014 at 3:56

The use in signal handlers and setjmp are the two guarantees the standard makes. On the other hand, the intent, at least at the start, was to support memory mapped IO. Which on some processors may require a fence or a membar. – Pittel 13/10, 2014 at 9:3

@Grogan Except nobody knows what "the read" means. For example, nobody believes an actual read from memory must be done -- no compiler I know of tries to bypass CPU caches on volatile accesses. – Lyre 13/10, 2014 at 20:55

@JamesKanze: Not so. Re signal handlers the standard says that during signal handling only volatile std::sig_atomic_t & lock-free atomic objects have defined values. But it also says that accesses to volatile objects are observable side-effects. – Grogan 14/10, 2014 at 3:58

@DavidSchwartz: The standard specifies that the implementation must define the actual effect corresponding to the "observable behaviour" side-effect on the abstract machine of a volatile access. – Grogan 14/10, 2014 at 4:17

@Grogan Exactly. So saying the standard says "the read has to be done" doesn't mean anything since it's the implementation that defines what "the read" is. "The standard says the implementation must do something that the implementation defines" is equivalent to saying that the standard imposes no requirements on the implementation. The implementation may, if it wishes, define "the read" in such a way that you get some specific, useful behavior on that platform. But the standard doesn't compel a useful definition, so there's no reliable, portable behavior about reads not being optimized away. – Lyre 14/10, 2014 at 5:15

@Grogan But it also says that what constitutes an access is "implementation defined". – Pittel 14/10, 2014 at 8:53

@DavidSchwartz Some compiler-architecture pairs map the standard-specified sequence of accesses to actual effects and working programs access volatiles to get those effects. The fact that some such pairs have no mapping or a trivial unhelpful mapping is relevant to quality of implementations but not to the point at hand. – Grogan 14/10, 2014 at 18:26

@JamesKanze See my above comment to DavidSchwartz. – Grogan 14/10, 2014 at 18:29

What about MMIO registers? I can't see them working properly without volatile. – Justajustemilieu 28/5, 2017 at 0:8

@KefSchecter MMIO registers are an entirely platform-specific thing. The platform may specify that volatile has semantics that are useful for them. Or it may provide some other non-standard means to access them. – Lyre 2/6, 2018 at 17:53

@DavidSchwartz "no compiler I know of tries to bypass CPU caches" Which CPU doesn't guarantees that for IO mappings? – Subinfeudation 2/6, 2018 at 18:48

@DavidSchwartz "Except nobody knows what "the read" means" But everybody would at least agrees that the CPU must check that the address "read" is valid and you have the permission to read it. – Subinfeudation 15/6, 2018 at 3:6

@Subinfeudation Most CPUs have CPU specific ways to identify IO mapping. x86 CPUs, for example, have MTRRs that identify IO mappings for special treatment that other addresses don't have. So nothing about IO mappings applies to anything but IO mappings. Actually, C++ doesn't require the CPU to check that the read is valid since invalid reads are UB, so you shouldn't, from a C++ PoV, assume the check is done. – Lyre 19/6, 2018 at 0:17

Does the C++ volatile keyword introduce a memory fence?

A C++ compiler which conforms to the specification is not required to introduce a memory fence. Your particular compiler might; direct your question to the authors of your compiler.

The function of "volatile" in C++ has nothing to do with threading. Remember, the purpose of "volatile" is to disable compiler optimizations so that reading from a register that is changing due to exogenous conditions is not optimized away. Is a memory address that is being written to by a different thread on a different CPU a register that is changing due to exogenous conditions? No. Again, if some compiler authors have chosen to treat memory addresses being written to by different threads on different CPUs as though they were registers changing due to exogenous conditions, that's their business; they are not required to do so. Nor are they required -- even if it does introduce a memory fence -- to, for instance, ensure that every thread sees a consistent ordering of volatile reads and writes.

In fact, volatile is pretty much useless for threading in C/C++. Best practice is to avoid it.

Moreover: memory fences are an implementation detail of particular processor architectures. In C#, where volatile explicitly is designed for multithreading, the specification does not say that half fences will be introduced, because the program might be running on an architecture that doesn't have fences in the first place. Rather, again, the specification makes certain (extremely weak) guarantees about what optimizations will be eschewed by the compiler, runtime and CPU to put certain (extremely weak) constraints on how some side effects will be ordered. In practice these optimizations are eliminated by use of half fences, but that's an implementation detail subject to change in the future.

The fact that you care about the semantics of volatile in any language as they pertain to multithreading indicates that you're thinking about sharing memory across threads. Consider simply not doing that. It makes your program far harder to understand and far more likely to contain subtle, impossible-to-reproduce bugs.

Verile answered 11/10, 2014 at 13:39 Comment(9)

"volatile is pretty much useless in C/C++." Not at all! You have a very usermode-desktop-centric view of the world... but most C and C++ code runs on embedded systems where volatile is very much needed for memory-mapped I/O. – Spiv 11/10, 2014 at 20:2

And the reason that volatile access are preserved isn't simply because exogenous conditions can change memory locations. The very access itself can trigger further actions. For example, it's very common for a read to advance a FIFO, or clear an interrupt flag. – Spiv 11/10, 2014 at 20:9

@BenVoigt: Useless for effectively dealing with threading woes was my intended meaning. – Verile 12/10, 2014 at 2:40

In fact, volatile is pretty much useless in C/C++. Best practice is to avoid it. is a bad hint! volatile is useful in any read/write/modify access if the underlying hardware access must be done without optimization. This has nothing to do with threading but with optimization. – Nardoo 12/10, 2014 at 11:48

@Klaus: Useless for effectively dealing with threading woes was my intended meaning. And you have correctly re-stated the first two sentences of my second paragraph, so we agree on that. – Verile 12/10, 2014 at 13:49

Thank you for clarifying it with your last edit. In fact, embedded programming can't be done without volatile at all. After introducing c++11/14 c++ is now really well suited for very small embedded systems. And here volatile is simply the must have. – Nardoo 12/10, 2014 at 16:28

@Nardoo But that's because of how those systems happen to implement volatile, not because of the behavior of volatile specified by the C++ standard. These are platform-specific, non-standard behaviors of volatile. – Lyre 13/10, 2014 at 0:21

@DavidSchwartz The standard obviously can't guarantee how memory mapped IO works. But memory mapped IO is why volatile was introduced into the C standard. Still, because the standard can't specify things like what actually happens at an "access", it says that "What constitutes an access to an object that has volatile-qualified type is implementation-defined." Far too many implementations today don't provide a useful definition of an access, which IMHO violates the spirit of the standard, even if it is conform to the letter. – Pittel 13/10, 2014 at 9:13

That edit is a definite improvement, but your explanation is still too focused on the "memory might be changed exogenously". volatile semantics are stronger than that, the compiler has to generate every requested access (1.9/8, 1.9/12), not simply guarantee that exogenous changes are eventually detected (1.10/27). In the world of memory-mapped I/O, a memory read can have arbitrary associated logic, like a property getter. You wouldn't optimize calls to property getters according to the rules you've stated for volatile, nor does the Standard allow it. – Spiv 13/10, 2014 at 13:14

What David is overlooking is the fact that the C++ standard specifies the behavior of several threads interacting only in specific situations and everything else results in undefined behavior. A race condition involving at least one write is undefined if you don't use atomic variables.

Consequently, the compiler is perfectly in its right to forego any synchronization instructions since your CPU will only notice the difference in a program that exhibits undefined behavior due to missing synchronization.

Fresher answered 10/10, 2014 at 22:36 Comment(5)

Nicely explained, thank you. The standard only defines the sequence of accesses to volatiles as observable as long as the program has no undefined behaviour. – Karren 10/10, 2014 at 22:55

If the program has a data race then the standard makes no requirements on observable behaviour of the program. The compiler isn't expected to add barriers to volatile accesses in order to prevent data races present in the program, that's the programmer's job, either by using explicit barriers or atomic operations. – Karren 10/10, 2014 at 23:4

Why do you think I'm overlooking that? What part of my argument do you think that invalidates? I 100% agree that the compiler is perfectly in its right to forgo any synchronization. – Lyre 13/10, 2014 at 0:23

This is simply wrong, or at least, it ignores the essential. volatile has nothing to do with threads; it's original purpose was to support memory mapped IO. And at least on some processors, supporting memory mapped IO would require fences. (Compilers don't do this, but that's a different issue.) – Pittel 13/10, 2014 at 9:1

@JamesKanze volatile has a lot to do with threads: volatile deals with memory that can be accessed without the compiler knowing that it can be accessed, and that covers many real world uses of shared data between threads on specific CPU. – Subinfeudation 15/6, 2018 at 3:10

First of all, the C++ standards do not guarantee the memory barriers needed for properly ordering the read / writes that are non atomic. volatile variables are recommended for using with MMIO, signal handling, etc. On most implementations volatile is not useful for multi-threading and it's not generally recommended.

Regarding the implementation of volatile accesses, this is the compiler choice.

This article, describing gcc behavior shows that you cannot use a volatile object as a memory barrier to order a sequence of writes to volatile memory.

Regarding icc behavior I found this source telling also that volatile does not guarantee ordering memory accesses.

Microsoft VS2013 compiler has a different behavior. This documentation explains how volatile enforces Release / Acquire semantics and enables volatile objects to be used in locks / releases on multi-threaded applications.

Another aspect that needs to be taken into considerations is that the same compiler may have a different behavior wrt. to volatile depending on the targeted hardware architecture. This post regarding the MSVS 2013 compiler clearly states the specifics of compiling with volatile for ARM platforms.

So my answer to:

Does the C++ volatile keyword introduce a memory fence?

would be: Not guaranteed, probably not but some compilers might do it. You should not rely on the fact that it does.

Cobwebby answered 10/10, 2014 at 19:55 Comment(6)

It doesn't prevent optimization, it just prevents the compiler from altering loads and stores beyond certain constraints. – Damondamour 10/10, 2014 at 21:0

It's not clear what you're saying. Are you saying that it happens to be the case on some unspecified compilers that volatile prevents the compiler from reordering loads/stores? Or are you saying the C++ standard requires it to do so? And if the latter, can you respond to my argument to the contrary quoted in the original question? – Lyre 13/10, 2014 at 0:20

@DavidSchwartz The standard prevents a reordering (from any source) of accesses through a volatile lvalue. Since it leaves the definition of "access" up to the implementation, however, this doesn't buy us much if the implementation doesn't care. – Pittel 13/10, 2014 at 9:14

I think some versions of MSC compilers did implement fence semantics for volatile, but there is no fence in the generated code from the compiler in Visual Studios 2012. – Pittel 13/10, 2014 at 9:16

@JamesKanze Which basically means that the only portable behavior of volatile is that specifically enumerated by the standard. (setjmp, signals, and so on.) – Lyre 13/10, 2014 at 20:54

@DavidSchwartz Yes. But then, it was never the intent of the committee that volatile have any portable semantics. It was introduced to support memory mapped IO, and memory mapped IO isn't portable. – Pittel 14/10, 2014 at 8:48

The compiler only inserts a memory fence on the Itanium architecture, as far as I know.

The volatile keyword is really best used for asynchronous changes, e.g., signal handlers and memory-mapped registers; it is usually the wrong tool to use for multithreaded programming.

Damondamour answered 10/10, 2014 at 19:55 Comment(2)

Sort of. 'the compiler' (msvc) inserts an memory fence when an architecture other than ARM is targeted and the /volatile:ms switch is used (the default). See msdn.microsoft.com/en-us/library/12a04hfd.aspx. Other compilers do not insert fences on volatile variables to my knowledge. The usage of volatile should be avoided unless dealing directly with hardware, signal handlers or non-c++11 conforming compilers. – Reposeful 10/10, 2014 at 21:16

@Reposeful No. volatile is extremely useful for many uses that don't ever deal with hardware. Whenever you want the implementation to generate CPU code that follows C/C++ code closely, use volatile. – Subinfeudation 15/6, 2018 at 3:19

It depends on which compiler "the compiler" is. Visual C++ does, since 2005. But the Standard does not require it, so some other compilers do not.

Spiv answered 10/10, 2014 at 20:3 Comment(7)

VC++ 2012 doesn't seem to insert a fence: int volatile i; int main() { return i; } generates a main with exactly two instructions: mov eax, i; ret 0;. – Pittel 13/10, 2014 at 9:18

@JamesKanze: Which version, exactly? And are you using any non-default compile options? I'm relying on the documentation (first affected version) and (latest version), which definitely mention acquire and release semantics. – Spiv 13/10, 2014 at 13:7

cl /help says version 18.00.21005.1. The directory it's in is C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC. The header on the command window says VS 2013. So with regards to the version... The only options I used were /c /O2 /Fa. (Without the /O2, it also sets up the local stack frame. But there is still no fence instruction.) – Pittel 13/10, 2014 at 14:37

@JamesKanze: I was more interested in the architecture, e.g. "Microsoft (R) C/C++ Optimizing Compiler Version 18.00.30723 for x64" Perhaps there's no fence because x86 and x64 have fairly strong cache coherency guarantees in their memory model to begin with? – Spiv 13/10, 2014 at 14:43

Maybe. I don't really know. The fact that I did this in main, so the compiler could see the whole program, and know that there were no other threads, or at least no other accesses to the variable before mine (so there could be no cache issues) could conceivable affect this as well, but somehow, I doubt it. – Pittel 13/10, 2014 at 16:59

@JamesKanze: In x86 asm, every load is an acquire-load, and every store is a release-store. To provide those semantics while compiling C++, MSVC merely has to avoid compile-time reordering. So a read of a volatile variable is a compiler-barrier for reordering loads/stores, but doesn't result in extra instructions. (For the same reason, atomic_thread_fence(mo_acquire) doesn't emit any instructions on x86, just like atomic_signal_fence(mo_acquire) on every architecture. Only atomic_thread_fence(mo_seq_cst) requires an MFENCE.) – Speedball 19/7, 2016 at 9:58

IIRC, I've read that some versions of MSVC emit mov eax, [mem] / MFENCE for x.load(memory_order_seq_cst), but other compilers only emit MFENCE with seq_cst stores. (Because x86 allows StoreLoad reordering, but not LoadStore reordering). For example, have a look at what gcc does for x86 or ARM, on the Godbolt compiler explorer. – Speedball 19/7, 2016 at 10:5

It doesn't have to. Volatile is not a synchronization primitive. It just disables optimisations, i.e. you get a predictable sequence of reads and writes within a thread in the same order as prescribed by the abstract machine. But reads and writes in different threads have no order in the first place, it makes no sense to speak of preserving or not preserving their order. The order between theads can be established by synchronization primitives, you get UB without them.

A bit of explanation regarding memory barriers. A typical CPU has several levels of memory access. There is a memory pipeline, several levels of cache, then RAM etc.

Membar instructions flush the pipeline. They don't change the order in which reads and writes are executed, it just forces outstanding ones to be executed at a given moment. It is useful for multithreaded programs, but not much otherwise.

Cache(s) are normally automatically coherent between CPUs. If one wants to make sure the cache is in sync with RAM, cache flush is needed. It is very different from a membar.

Apoplexy answered 10/10, 2014 at 19:51 Comment(8)

"It just disables optimisations" Are you saying that's all it happens to do on some unspecified compilers or platforms? Or are you saying that's all the C++ standard requires it to do. If the former, I don't see how useful that observation is. If the latter, I'd like to hear your response to my argument to the contrary cited in the original quesiton. – Lyre 13/10, 2014 at 0:25

The latter. I don't see how your argument applies. It is about reordering, but there's no order between threads to begin with. If you are talking about a single thread, it is equally invalid because what constitutes access to a volatile variable is implementation defined and the definition need not involve a particular kind of physical memory. RAM is no better than cache or swap. – Westberry 13/10, 2014 at 6:30

So you're saying the C++ standard says that volatile just disables compiler optimizations? That doesn't make any sense. Any optimization the compiler can do can, at least in principle, equally well be done by the CPU. So if the standard said it just disabled compiler optimizations, that would mean it would provide no behavior one could rely on in portable code at all. But that's obviously not true because portable code can rely on its behavior with respect to setjmp and signals. – Lyre 13/10, 2014 at 20:48

@DavidSchwartz No, the standard says no such thing. Disabling optimisations is just what is commonly done to implement the standard. The standard requires that observable behaviour happens in the same order as required by the abstract machine. When the abstract machine does not require any order, the implementation is free to use any order or no order at all. Access to volatile variables in different threads are not ordered unless additional synchronization is applied. – Westberry 14/10, 2014 at 9:9

So when you said, "the latter", you meant the former? If it just happens to disable optimizations on some platforms, then you aren't guaranteed to get a predictable sequence of reads and writes, you just might happen to on some platforms. – Lyre 14/10, 2014 at 18:16

@DavidSchwartz I apologise for imprecise wording. The standard does not require that optimisations are disabled. It has no notion of optimisation at all. Rather, it specifies behaviour that in practice requires compilers to disable certain optimisations in such a way that the observable sequence of reads and writes is compliant with the standard. – Westberry 14/10, 2014 at 21:39

Except it doesn't require that, because the standard permits implementations to define "observable sequence of reads and writes" however they want. If implementations choose to define observable sequences such that optimizations have to be disabled, then they do. If not, then not. You get a predictable sequence of reads and writes if, and only if, the implementation has chosen to give it to you. – Lyre 14/10, 2014 at 21:39

No, the implementation needs to define what constitutes a single access. The sequence of such accesses is prescribed by the abstract machine. An implementation has to preserve the order. The standard explicitly says that "volatile is a hint to the implementation to avoid aggressive optimization involving the object", albeit in a non-normative part, but the intent is clear. – Westberry 14/10, 2014 at 21:56

This is largely from memory, and based on pre-C++11, without threads. But having participated in discussions on threading in the committe, I can say that there was never an intent by the committee that volatile could be used for synchronization between threads. Microsoft proposed it, but the proposal didn't carry.

The key specification of volatile is that access to a volatile represents an "observable behavior", just like IO. In the same way the compiler cannot reorder or remove specific IO, it cannot reorder or remove accesses to a volatile object (or more correctly, accesses through an lvalue expression with volatile qualified type). The original intent of volatile was, in fact, to support memory mapped IO. The "problem" with this, however, is that it is implementation defined what constitutes a "volatile access". And many compilers implement it as if the definition was "an instruction which reads or writes to memory has been executed". Which is a legal, albeit useless definition, if the implementation specifies it. (I've yet to find the actual specification for any compiler.)

Arguably (and it's an argument I accept), this violates the intent of the standard, since unless the hardware recognizes the addresses as memory mapped IO, and inhibits any reordering, etc., you can't even use volatile for memory mapped IO, at least on Sparc or Intel architectures. Never the less, none of the comilers I've looked at (Sun CC, g++ and MSC) do output any fence or membar instructions. (About the time Microsoft proposed extending the rules for volatile, I think some of their compilers implemented their proposal, and did emit fence instructions for volatile accesses. I've not verified what recent compilers do, but it wouldn't surprise me if it depended on some compiler option. The version I checkd—I think it was VS6.0—didn't emit fences, however.)

Pittel answered 12/10, 2014 at 11:30 Comment(10)

Why do you just say the compiler cannot reorder or remove accesses to volatile objects? Surely if the accesses are observable behavior, then surely it is precisely equally important to prevent the CPU, write posting buffers, memory controller, and everything else from reordering them as well. – Lyre 13/10, 2014 at 0:22

@DavidSchwartz Because that's what the standard says. Certainly, from a practical point of view, what the compilers I've verified do is totally useless, but the standard weasel-words this enough so that they can still claim conformance (or could, if they actually documented it). – Pittel 13/10, 2014 at 8:57

@DavidSchwartz: For exclusive (or mutex'd) memory-mapped I/O to peripherals, the volatile semantics are perfectly adequate. Generally such peripherals report their memory areas as non-cacheable, which helps with reordering at the hardware level. – Spiv 13/10, 2014 at 14:48

@BenVoigt I wondered somehow about that: the idea that the processor somehow "knows" that the address it is dealing with is memory mapped IO. As far as I know, Sparcs don't have any support for this, so that would still make Sun CC and g++ on a Sparc unusable for memory mapped IO. (When I looked into this, I was mainly interested in a Sparc.) – Pittel 13/10, 2014 at 17:2

@JamesKanze: From what little searching I did, it looks like Sparc has dedicated address ranges for "alternate views" of memory that are noncacheable. As long as your volatile access points into the ASI_REAL_IO portion of the address space, I think you should be ok. (Altera NIOS uses a similar technique, with high bits of the address controlling MMU bypass; I'm sure there are others too) – Spiv 13/10, 2014 at 17:35

@JamesKanze That's not what the standard says. The standard doesn't talk about what the compiler can or cannot do at all. It talks about what the generated code can or cannot do. It would be incoherent for the standard to say "the compiler cannot do X, but it can generate code that does X". – Lyre 13/10, 2014 at 20:45

@BenVoigt It could be. I didn't find anything about them at the time I was looking (maybe 10 years ago), but I didn't look very far either. Even if the Sparc architecture doesn't define such, of course, an individual chip set implementing it could. – Pittel 14/10, 2014 at 8:41

@DavidSchwartz The standard says "what constitutes an access is implementation defined". If the implementation defines it as the execution of a load or a store instruction, then that's how the implementation defines it. Even if this has no visible effect outside the processor. In this case, practically speaking, the standard allows the implementation to define the "observable behavior" to be something that you cannot actually observe. – Pittel 14/10, 2014 at 8:45

@BenVoigt Concerning noncacheable views: the cache isn't the problem; it's the pipeline in the CPU itself. Any configuration would have to affect the CPU for volatile to be effective without a membar. – Pittel 14/10, 2014 at 8:47

@JamesKanze "the standard allows the implementation to define the "observable behavior" to be something that you cannot actually observe" But you could on a CPU emulator! – Subinfeudation 15/6, 2018 at 3:12

The compiler needs to introduce a memory fence around volatile accesses if, and only if, that is necessary to make the uses for volatile specified in the standard work (setjmp, signal handlers, and so on) on that particular platform.

Note that some compilers do go way beyond what's required by the C++ standard in order to make volatile more powerful or useful on those platforms. Portable code shouldn't rely on volatile to do anything beyond what's specified in the C++ standard.

Lyre answered 13/10, 2014 at 0:27 Comment(1)

In 2017 compilers do not make memory fences around volatile. – Revareval 11/1, 2022 at 23:37

I always use volatile in interrupt service routines, e.g. the ISR (often assembly code) modifies some memory location and the higher level code that runs outside of the interrupt context accesses the memory location through a pointer to volatile.

I do this for RAM as well as memory-mapped IO.

Based on the discussion here it seems this is still a valid use of volatile but doesn't have anything to do with multiple threads or CPUs. If the compiler for a microcontroller "knows" that there can't be any other accesses (e.g. everyting is on-chip, no cache and there's only one core) I would think that a memory fence isn't implied at all, the compiler just needs to prevent certain optimisations.

As we pile more stuff into the "system" that executes the object code almost all bets are off, at least that's how I read this discussion. How could a compiler ever cover all bases?

Rainout answered 14/10, 2014 at 22:19 Comment(0)

I think the confusion around volatile and instruction reordering stems from the 2 notions of reorderings CPUs do:

Out-of-order execution.
Sequence of memory read/writes as seen by other CPUs (reordering in a sense that each CPU might see a different sequence).

Volatile affects how a compiler generates the code assuming single threaded execution (this includes interrupts). It doesn't imply anything about memory barrier instructions, but it rather precludes a compiler from performing certain kinds of optimizations related to memory accesses.
A typical example is re-fetching a value from memory, instead of using one cached in a register.

Out-of-order execution

CPUs can execute instructions out-of-order/speculatively provided that the end result could have happened in the original code. CPUs can perform transformations that are disallowed in compilers because compilers can only perform transformations which are correct in all circumstances. In contrast, CPUs can check the validity of these optimizations and back out of them if they turn out to be incorrect.

Sequence of memory read/writes as seen by other CPUs

The end result of a sequence of instruction, the effective order, must agree with the semantics of the code generated by a compiler. However the actual execution order chosen by the CPU can be different. The effective order as seen in other CPUs (every CPU can have a different view) can be constrained by memory barriers.
I'm not sure how much effective and actual order can differ because I don't know to what extent memory barriers can preclude CPUs from performing out-of-order execution.

Sources:

Burn answered 19/11, 2017 at 0:38 Comment(0)

While I was working through an online downloadable video tutorial for 3D Graphics & Game Engine development working with modern OpenGL. We did use volatile within one of our classes. The tutorial website can be found here and the video working with the volatile keyword is found in the Shader Engine series video 98. These works are not of my own but are accredited to Marek A. Krzeminski, MASc and this is an excerpt from the video download page.

"Since we can now have our games run in multiple threads it is important to synchronize data between threads properly. In this video I show how to create a volitile locking class to ensure volitile variables are properly synchronized..."

And if you are subscribed to his website and have access to his video's within this video he references this article concerning the use of Volatile with multithreading programming.

Here is the article from the link above: http://www.drdobbs.com/cpp/volatile-the-multithreaded-programmers-b/184403766

volatile: The Multithreaded Programmer's Best Friend

By Andrei Alexandrescu, February 01, 2001

The volatile keyword was devised to prevent compiler optimizations that might render code incorrect in the presence of certain asynchronous events.

I don't want to spoil your mood, but this column addresses the dreaded topic of multithreaded programming. If — as the previous installment of Generic says — exception-safe programming is hard, it's child's play compared to multithreaded programming.

Programs using multiple threads are notoriously hard to write, prove correct, debug, maintain, and tame in general. Incorrect multithreaded programs might run for years without a glitch, only to unexpectedly run amok because some critical timing condition has been met.

Needless to say, a programmer writing multithreaded code needs all the help she can get. This column focuses on race conditions — a common source of trouble in multithreaded programs — and provides you with insights and tools on how to avoid them and, amazingly enough, have the compiler work hard at helping you with that.

Just a Little Keyword

Although both C and C++ Standards are conspicuously silent when it comes to threads, they do make a little concession to multithreading, in the form of the volatile keyword.

Just like its better-known counterpart const, volatile is a type modifier. It's intended to be used in conjunction with variables that are accessed and modified in different threads. Basically, without volatile, either writing multithreaded programs becomes impossible, or the compiler wastes vast optimization opportunities. An explanation is in order.

Consider the following code:
class Gadget {
public:
    void Wait() {
        while (!flag_) {
            Sleep(1000); // sleeps for 1000 milliseconds
        }
    }
    void Wakeup() {
        flag_ = true;
    }
    ...
private:
    bool flag_;
};
The purpose of Gadget::Wait above is to check the flag_ member variable every second and return when that variable has been set to true by another thread. At least that's what its programmer intended, but, alas, Wait is incorrect.

Suppose the compiler figures out that Sleep(1000) is a call into an external library that cannot possibly modify the member variable flag_. Then the compiler concludes that it can cache flag_ in a register and use that register instead of accessing the slower on-board memory. This is an excellent optimization for single-threaded code, but in this case, it harms correctness: after you call Wait for some Gadget object, although another thread calls Wakeup, Wait will loop forever. This is because the change of flag_ will not be reflected in the register that caches flag_. The optimization is too ... optimistic.

Caching variables in registers is a very valuable optimization that applies most of the time, so it would be a pity to waste it. C and C++ give you the chance to explicitly disable such caching. If you use the volatile modifier on a variable, the compiler won't cache that variable in registers — each access will hit the actual memory location of that variable. So all you have to do to make Gadget's Wait/Wakeup combo work is to qualify flag_ appropriately:
class Gadget {
public:
    ... as above ...
private:
    volatile bool flag_;
};
Most explanations of the rationale and usage of volatile stop here and advise you to volatile-qualify the primitive types that you use in multiple threads. However, there is much more you can do with volatile, because it is part of C++'s wonderful type system.

Using volatile with User-Defined Types

You can volatile-qualify not only primitive types, but also user-defined types. In that case, volatile modifies the type in a way similar to const. (You can also apply const and volatile to the same type simultaneously.)

Unlike const, volatile discriminates between primitive types and user-defined types. Namely, unlike classes, primitive types still support all of their operations (addition, multiplication, assignment, etc.) when volatile-qualified. For example, you can assign a non-volatile int to a volatile int, but you cannot assign a non-volatile object to a volatile object.

Let's illustrate how volatile works on user-defined types on an example.
class Gadget {
public:
    void Foo() volatile;
    void Bar();
    ...
private:
    String name_;
    int state_;
};
...
Gadget regularGadget;
volatile Gadget volatileGadget;
If you think volatile is not that useful with objects, prepare for some surprise.
volatileGadget.Foo(); // ok, volatile fun called for
                  // volatile object
regularGadget.Foo();  // ok, volatile fun called for
                  // non-volatile object
volatileGadget.Bar(); // error! Non-volatile function called for
                  // volatile object!
The conversion from a non-qualified type to its volatile counterpart is trivial. However, just as with const, you cannot make the trip back from volatile to non-qualified. You must use a cast:
Gadget& ref = const_cast<Gadget&>(volatileGadget);
ref.Bar(); // ok
A volatile-qualified class gives access only to a subset of its interface, a subset that is under the control of the class implementer. Users can gain full access to that type's interface only by using a const_cast. In addition, just like constness, volatileness propagates from the class to its members (for example, volatileGadget.name_ and volatileGadget.state_ are volatile variables).

volatile, Critical Sections, and Race Conditions

The simplest and the most often-used synchronization device in multithreaded programs is the mutex. A mutex exposes the Acquire and Release primitives. Once you call Acquire in some thread, any other thread calling Acquire will block. Later, when that thread calls Release, precisely one thread blocked in an Acquire call will be released. In other words, for a given mutex, only one thread can get processor time in between a call to Acquire and a call to Release. The executing code between a call to Acquire and a call to Release is called a critical section. (Windows terminology is a bit confusing because it calls the mutex itself a critical section, while "mutex" is actually an inter-process mutex. It would have been nice if they were called thread mutex and process mutex.)

Mutexes are used to protect data against race conditions. By definition, a race condition occurs when the effect of more threads on data depends on how threads are scheduled. Race conditions appear when two or more threads compete for using the same data. Because threads can interrupt each other at arbitrary moments in time, data can be corrupted or misinterpreted. Consequently, changes and sometimes accesses to data must be carefully protected with critical sections. In object-oriented programming, this usually means that you store a mutex in a class as a member variable and use it whenever you access that class' state.

Experienced multithreaded programmers might have yawned reading the two paragraphs above, but their purpose is to provide an intellectual workout, because now we will link with the volatile connection. We do this by drawing a parallel between the C++ types' world and the threading semantics world.

Outside a critical section, any thread might interrupt any other at any time; there is no control, so consequently variables accessible from multiple threads are volatile. This is in keeping with the original intent of volatile — that of preventing the compiler from unwittingly caching values used by multiple threads at once.

Inside a critical section defined by a mutex, only one thread has access. Consequently, inside a critical section, the executing code has single-threaded semantics. The controlled variable is not volatile anymore — you can remove the volatile qualifier.

In short, data shared between threads is conceptually volatile outside a critical section, and non-volatile inside a critical section.

You enter a critical section by locking a mutex. You remove the volatile qualifier from a type by applying a const_cast. If we manage to put these two operations together, we create a connection between C++'s type system and an application's threading semantics. We can make the compiler check race conditions for us.

LockingPtr

We need a tool that collects a mutex acquisition and a const_cast. Let's develop a LockingPtr class template that you initialize with a volatile object obj and a mutex mtx. During its lifetime, a LockingPtr keeps mtx acquired. Also, LockingPtr offers access to the volatile-stripped obj. The access is offered in a smart pointer fashion, through operator-> and operator*. The const_cast is performed inside LockingPtr. The cast is semantically valid because LockingPtr keeps the mutex acquired for its lifetime.

First, let's define the skeleton of a class Mutex with which LockingPtr will work:
class Mutex {
public:
    void Acquire();
    void Release();
    ...    
};
To use LockingPtr, you implement Mutex using your operating system's native data structures and primitive functions.

LockingPtr is templated with the type of the controlled variable. For example, if you want to control a Widget, you use a LockingPtr that you initialize with a variable of type volatile Widget.

LockingPtr's definition is very simple. LockingPtr implements an unsophisticated smart pointer. It focuses solely on collecting a const_cast and a critical section.
template <typename T>
class LockingPtr {
public:
    // Constructors/destructors
    LockingPtr(volatile T& obj, Mutex& mtx)
      : pObj_(const_cast<T*>(&obj)), pMtx_(&mtx) {    
        mtx.Lock();    
    }
    ~LockingPtr() {    
        pMtx_->Unlock();    
    }
    // Pointer behavior
    T& operator*() {    
        return *pObj_;    
    }
    T* operator->() {   
        return pObj_;   
    }
private:
    T* pObj_;
    Mutex* pMtx_;
    LockingPtr(const LockingPtr&);
    LockingPtr& operator=(const LockingPtr&);
};
In spite of its simplicity, LockingPtr is a very useful aid in writing correct multithreaded code. You should define objects that are shared between threads as volatile and never use const_cast with them — always use LockingPtr automatic objects. Let's illustrate this with an example.

Say you have two threads that share a vector object:
class SyncBuf {
public:
    void Thread1();
    void Thread2();
private:
    typedef vector<char> BufT;
    volatile BufT buffer_;
    Mutex mtx_; // controls access to buffer_
};
Inside a thread function, you simply use a LockingPtr to get controlled access to the buffer_ member variable:
void SyncBuf::Thread1() {
    LockingPtr<BufT> lpBuf(buffer_, mtx_);
    BufT::iterator i = lpBuf->begin();
    for (; i != lpBuf->end(); ++i) {
        ... use *i ...
    }
}
The code is very easy to write and understand — whenever you need to use buffer_, you must create a LockingPtr pointing to it. Once you do that, you have access to vector's entire interface.

The nice part is that if you make a mistake, the compiler will point it out:
void SyncBuf::Thread2() {
    // Error! Cannot access 'begin' for a volatile object
    BufT::iterator i = buffer_.begin();
    // Error! Cannot access 'end' for a volatile object
    for ( ; i != lpBuf->end(); ++i ) {
        ... use *i ...
    }
}
You cannot access any function of buffer_ until you either apply a const_cast or use LockingPtr. The difference is that LockingPtr offers an ordered way of applying const_cast to volatile variables.

LockingPtr is remarkably expressive. If you only need to call one function, you can create an unnamed temporary LockingPtr object and use it directly:
unsigned int SyncBuf::Size() {
return LockingPtr<BufT>(buffer_, mtx_)->size();
}
Back to Primitive Types

We saw how nicely volatile protects objects against uncontrolled access and how LockingPtr provides a simple and effective way of writing thread-safe code. Let's now return to primitive types, which are treated differently by volatile.

Let's consider an example where multiple threads share a variable of type int.
class Counter {
public:
    ...
    void Increment() { ++ctr_; }
    void Decrement() { —ctr_; }
private:
    int ctr_;
};
If Increment and Decrement are to be called from different threads, the fragment above is buggy. First, ctr_ must be volatile. Second, even a seemingly atomic operation such as ++ctr_ is actually a three-stage operation. Memory itself has no arithmetic capabilities. When incrementing a variable, the processor:

Reads that variable in a register

Increments the value in the register

Writes the result back to memory

This three-step operation is called RMW (Read-Modify-Write). During the Modify part of an RMW operation, most processors free the memory bus in order to give other processors access to the memory.

If at that time another processor performs a RMW operation on the same variable, we have a race condition: the second write overwrites the effect of the first.

To avoid that, you can rely, again, on LockingPtr:
class Counter {
public:
    ...
    void Increment() { ++*LockingPtr<int>(ctr_, mtx_); }
    void Decrement() { —*LockingPtr<int>(ctr_, mtx_); }
private:
    volatile int ctr_;
    Mutex mtx_;
};
Now the code is correct, but its quality is inferior when compared to SyncBuf's code. Why? Because with Counter, the compiler will not warn you if you mistakenly access ctr_ directly (without locking it). The compiler compiles ++ctr_ if ctr_ is volatile, although the generated code is simply incorrect. The compiler is not your ally anymore, and only your attention can help you avoid race conditions.

What should you do then? Simply encapsulate the primitive data that you use in higher-level structures and use volatile with those structures. Paradoxically, it's worse to use volatile directly with built-ins, in spite of the fact that initially this was the usage intent of volatile!

volatile Member Functions

So far, we've had classes that aggregate volatile data members; now let's think of designing classes that in turn will be part of larger objects and shared between threads. Here is where volatile member functions can be of great help.

When designing your class, you volatile-qualify only those member functions that are thread safe. You must assume that code from the outside will call the volatile functions from any code at any time. Don't forget: volatile equals free multithreaded code and no critical section; non-volatile equals single-threaded scenario or inside a critical section.

For example, you define a class Widget that implements an operation in two variants — a thread-safe one and a fast, unprotected one.
class Widget {
public:
    void Operation() volatile;
    void Operation();
    ...
private:
    Mutex mtx_;
};
Notice the use of overloading. Now Widget's user can invoke Operation using a uniform syntax either for volatile objects and get thread safety, or for regular objects and get speed. The user must be careful about defining the shared Widget objects as volatile.

When implementing a volatile member function, the first operation is usually to lock this with a LockingPtr. Then the work is done by using the non- volatile sibling:
void Widget::Operation() volatile {
    LockingPtr<Widget> lpThis(*this, mtx_);
    lpThis->Operation(); // invokes the non-volatile function
}
Summary

When writing multithreaded programs, you can use volatile to your advantage. You must stick to the following rules:

Define all shared objects as volatile.

Don't use volatile directly with primitive types.

When defining shared classes, use volatile member functions to express thread safety.

If you do this, and if you use the simple generic component LockingPtr, you can write thread-safe code and worry much less about race conditions, because the compiler will worry for you and will diligently point out the spots where you are wrong.

A couple of projects I've been involved with use volatile and LockingPtr to great effect. The code is clean and understandable. I recall a couple of deadlocks, but I prefer deadlocks to race conditions because they are so much easier to debug. There were virtually no problems related to race conditions. But then you never know.

Acknowledgements

Many thanks to James Kanze and Sorin Jianu who helped with insightful ideas.

Andrei Alexandrescu is a Development Manager at RealNetworks Inc. (www.realnetworks.com), based in Seattle, WA, and author of the acclaimed book Modern C++ Design. He may be contacted at www.moderncppdesign.com. Andrei is also one of the featured instructors of The C++ Seminar (www.gotw.ca/cpp_seminar).

This article might be a little dated, but it does give good insight towards an excellent use of using the volatile modifier with in the use of multithreaded programming to help keep events asynchronous while having the compiler checking for race conditions for us. This may not directly answer the OPs original question about creating a memory fence, but I choose to post this as an answer for others as an excellent reference towards a good use of volatile when working with multithreaded applications.

Dissyllable answered 19/11, 2017 at 17:9 Comment(0)

The keyword volatile essentially means that reads and writes an object should be performed exactly as written by the program, and not optimized in any way. Binary code should follow C or C++ code: a load where this is read, a store where there is a write.

It also means that no read should be expected to result in a predictable value: the compiler shouldn't assume anything about a read even immediately following a write to the same volatile object:

volatile int i;
i = 1;
int j = i; 
if (j == 1) // not assumed to be true

volatile may be the most important tool in the "C is a high level assembly language" toolbox.

Whether declaring an object volatile is sufficient for ensuring the behavior of code that deals with asynchronous changes depends on the platform: different CPU give different levels of guaranteed synchronization for normal memory reads and writes. You probably shouldn't try to write such low level multithreading code unless you are an expert in the area.

Atomic primitives provide a nice higher level view of objects for multithreading that makes it easy to reason about code. Almost all programmers should use either atomic primitives or primitives that provide mutual exclusions like mutexes, read-write-locks, semaphores, or other blocking primitives.

Subinfeudation answered 15/6, 2018 at 3:36 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Out-of-order execution

Sequence of memory read/writes as seen by other CPUs

Recommended topics

Hot tags